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Abstract 

Distributing  simulations  among  multiple  processors  is  one  approach  to  reducing 
VHDL  simulation  time  for  large  VLSI  circuit  designs.  However,  parallel  simulation 
introduces  the  problem  of  how  to  partition  die  logic  gates  and  system  behaviors  among 
the  available  processm^  in  order  to  obtain  maximum  speedup.  This  researdi  investigates 
deliberate  partitioning  algorithms  that  account  for  the  cmqplex  inter-depoidency  structure 
of  the  circuit  behaviors.  Once  an  initial  partition  has  been  obtained,  a  border  annealing 
algorithm  is  used  to  iteratively  inqvove  the  partititm.  In  addition,  methods  of  measuring 
the  cost  of  a  partition  and  relating  it  to  the  resulting  simulation  peifnnnance  are 
investigated.  Structural  circuits  ranging  from  one  thousand  to  over  four  thousand 
behaviOTs  are  simulated.  The  deliheiate  partitions  consistratly  provided  superior  speedup 
to  a  landcnn  distribution  of  the  circuit  behaviors. 
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PARTITIONING  STRUCTURAL  VHDL  dRCUITS 
FOR  PARALLEL  EXECUTION 
ON  HYPERCUBES 

/.  Introduction 

1.1  Background 

Modern  integrated  circuit  designs  are  rapidly  growing  largo'  and  more  con^lex,  with 
chip  transistor  counts  increasing  by  approximately  25%  per  year,  doubling  every  diree 
years  (13:17).  In  order  to  reduce  chip  costs  and  turnaround  times,  designers  use 
st^histicated  simulaticm  tools  to  validate  dieir  designs  prior  to  chip  fabrication  (4:1).  The 
Department  of  Defense  (DOD)  established  the  Very  High  Speed  Integrated  (Circuit 
(VHSIQ  program  in  1979  with  the  primary  objective  advancing  die  state  of  die  art  in 
the  areas  of  large  scale  circuit  design  and  manufacturing  technology  (15:1).  As  part  of 
this  program,  the  VHSIC  Hardware  Description  Language  (VHDL)  program  began  in 
1981  with  the  goal  of  developing  a  standard  simulation  language  for  the  support  oi 
hardware  design  (15, 4). 

As  the  size  and  complexity  of  circuit  designs  continue  their  upward  trend,  dieie  is  a 
growing  need  to  increase  the  speed  of  the  VHDL  simulations.  Slow  sequential 
simulations  res.  ft  in  a  longer  iterativ  :  design  inocess  and  increase  the  cost  of  the  final 
product  In  an  effort  to  achieve  the  desired  performance  improvement  the  Advanced 
Research  Projects  Agency  (AkPA)  has  sponsored  the  QUEST  project  with  the  objective 
of  obtaining  a  thousand-fold  speedup  in  large  VHDL  simulations  using  the  commocial 
Inteimetrics  VHDL  simulator  running  on  a  VAX  8650  as  the  baseline  (6:2, 19:1-1). 
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Previous  AFTT  research  has  investigated  the  possibility  of  achieving  speediq)  by 
distributing  the  VHDL  simulations  over  multiple  processors  for  parallel  execution.  By 
effectively  sharing  the  simulation  workload  among  multiple  processors,  simulations  of 
complex  chip  designs  can  be  run  faster,  resulting  in  a  more  efficient  and  cost-effective 
design  cycle. 

AFTT  research  in  1991-92  focused  on  the  internal  data  structures  used  in  the 
Intermetrics  ctmimercial  VHDL  simulaKn*  which  runs  in  a  sequential  mode.  A  method  of 
intercepting  the  intermediate  C  source  code  from  the  Intermetrics  compiler,  transforming 
it  into  parallel  models,  and  executing  the  transfcnmed  code  in  parallel  with  conect  results 
on  Intel  iPSC/2  and  iPSC/860  hypercubes  has  been  demonstrated  (8, 4).  The  result  of  this 
research  is  a  parallel  VHDL  simulator,  referred  to  as  VSIM,  that  implements  a  selected 
subset  of  the  standard  VHDL  language  (4). 

Breeden’s  results  demonstrated  that  speedup  on  multiple  processors  can  be  achieved 
under  limited  circumstances  using  a  random  partitioning  of  the  VHDL  behaviors^  among 
the  processOTs  of  the  hypercube  (4).  In  his  random  partititming  iqrproach,  the  objective 
was  to  randomly  assign  an  equal  number  of  behaviors  to  each  processor  without 
considering  their  complex  inter-dependency  relaticmships.  As  a  result,  the  speedup  results 
were  significantly  less  than  optimal,  and  fell  off  nq)idly  as  the  number  of  processors  was 
increased  due  to  increases  in  ccmununications  overhead.  This  thesis  research  effort 
focuses  on  the  development  of  efficient  and  effective  partitioning  strategies  to  map  the 
logical  VHDL  behaviors  to  the  physical  processors  of  a  hypercube  in  order  to  take 
maximum  advantage  of  the  parallelism  available  in  the  simulation  application. 


^  A  behavior  is  an  executable  VHDL  process  rqtresenting  a  logic  gate,  source  signal,  or  other  simple 
VHIH.  process. 
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Figure  1.  Speedup  Curve  for  Wallace  Tree  with  Random  Partitioning 

12  Problem  Statement 

AFTT’s  parallel  VHDL  simulator,  VSIM,  has  been  validated  on  circuits  as  large  as  an 
8x8  Wallace  Tree  Multiplier,  containing  over  1000  VHDL  behaviors,  on  both  an  8-node 
iPSC/2  and  an  8-node  iPSCV860  hypercube  (4).  However,  in  order  to  maximize  the 
benefits  of  parallelization,  a  deliberate  partitioning  strategy  is  required  diat  takes  into 
account  the  complex  inter-dependency  relatitmships  of  the  VHDL  behaviors  when 
mapping  them  onto  the  physical  processtna  of  the  parallel  system.  Otherwise,  the 
communications  overhead  required  to  maintain  synchronization  among  the  processors 
will  negate  the  potential  ^)eedup  benefits.  Fot  example.  Figure  1  shows  how  the  ^>eedup 
curve  for  the  Wallace  Tree  Multiplier  on  the  iPSC/2  with  a  random  partitioning  of  the 
behaviors  takes  a  downward  turn  as  the  number  of  processors  is  increased  past  four. 

IJ  Research  Objectives 

The  primary  (^jective  of  this  thesis  is  to  denxxistrate  improved  speedup  over  randcnn 
partitioning  in  the  simulatitm  of  medium  m  large  sized  VHDL  circuits  using  the  VSIM 


panllel  simulator.  This  will  be  accoiiq>lished  thiou^  the  use  of  a  delibeiaie  panitioiiing 
strategy.  Specific  research  goals  include: 

•  Developing  an  efficient  and  effective  partitioning  strategy  that  acownts  for  the 
coo^lex  inter-dependency  structure  of  the  VHDL  circuit  bong  simulated. 

•  Investigating  methods  of  cooqHiting  die  cost  of  a  partition. 

•  Quantifying  the  relationship  between  the  cost  of  a  partition  and  the  resulting 
performance  of  the  simulatkm. 

•  Demonstrating  inqnoved  qieedup  over  a  random  partitioning  using  a  variety  of 
VHDL  circuits. 

1.4  Assumptions 

The  research  by  Comeau  provided  the  foundation  for  the  transformation  of 
Intermetrics  VHDL  models  into  models  that  can  be  executed  in  a  parallel  environment 
(8).  Breeden  built  upon  this  work,  automating  the  transformaticm  process,  and  validating 
the  results  of  the  parallel  simulator  VSIM  (4).  Building  upon  their  research,  the  following 
assumptions  are  made  in  this  thesis: 

•  The  subset  of  the  standard  VHDL  implemented  by  VSIM,  as  described  in  (4),  is 
adequate  to  demonstrate  the  feasibility  and  effectiveness  of  various  partititming 
strategies. 

•  The  commercial  Intermetrics  VHDL  cmnpiler,  version  2.1,  September  1990,  will 
be  used  to  provide  the  sequential  VHDL  models  (4). 

•  The  conservative  Chandy-Misra  algorithm  for  parallel  discrete  event  simulation 
(PDES)  is  used  to  maintain  synchronization  between  the  processes  of  the  parallel 
system.  Using  the  SPECTRUM^  testbed,  the  null-message  protocol  is  used  to 
provide  deadlock  avoidance  (4).  To  maintain  consistency  with  the  AFIT 

^  Simulation  Protocol  Evaluation  on  a  Concurrent  Testbed  using  Reusable  Modules  (20). 
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simulation  environment,  this  synchronization  protocol  will  not  be  significantly 
altered  for  this  thesis. 

•  Secondary  storage  input/ouq)ut  (I/O)  during  parallel  simulations  on  the  Intel 
hypercubes  has  been  shown  to  overwhelm  the  benefits  of  parallelization  (4:80). 
This  thesis  will  focus  on  achieving  ctxnputadtmal  ^)eedup  only.  It  is  assumed  that 
other  research  will  effectively  address  the  architectural  issues  associated  with  the 
large  I/O  requirements  of  PDES  iq)plicati(Mis. 

•  Under  the  SPECTRUM  simulation  environment,  individual  VHDL  behaviors  are 
grouped  into  logical  processes  (IPs)  u>  increase  the  granularity  of  the  iq>plication 
tasks.  The  research  in  this  thesis  makes  the  assumption  of  one  LP  per  physical 
processor.  This  assumption  eliminates  die  context  switching  and  message  passing 
overiiead  encountered  when  multiplexing  several  LPs  on  a  single  processor. 

•  A  graph-based  behavior  dependency  representation  will  provide  the  information 
necessary  to  make  sound  partitioning  deciskxis  in  an  efficient  manner. 

IJ  Scope 

The  following  list  outlines  the  limits  on  the  scqie  of  this  research  effort: 

•  Finding  an  optimal  solution  to  the  problem  of  moping  N  inter-dependent  tasks 
onto  P  processors  is  known  to  be  NP-Complete  (22:142).  This  research  will  seek 
an  efficient  and  effective  heuristic  approach  that  results  in  ctmsistently  good 
solutions,  though  they  may  be  sub-optimal. 

•  The  subset  of  the  standard  VHDL  suppcxted  by  VSIM  will  not  be  extended  as  part 
of  this  research  effort. 

•  Circuit  descriptions  used  to  validate  various  partitioning  strategies  will  be  limi^ 
to  less  than  5000  behaviors.  To  implement  circuits  much  larger  than  this  limit  in  a 
realistic  manner  will  require  extensions  to  the  VHDL  subset  supported  by  VSIM. 
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•  This  research  will  not  alter  the  conservative  null-message  parallel  discrete  event 
simulation  (PDES)  protocol  currently  implemented  by  VSIM  except  when  such 
alterations  directly  suppmt  the  validation  of  a  paitititHung  strategy. 

1.6  Limitations 

The  limitations  of  the  VSIM  parallel  VHDL  simulator  are  described  in  (4:4-6).  No 
new  limitations  on  VSIM  are  imposed  as  a  result  of  this  thesis  effort.  However,  the 
partitioning  tool  implemented  as  part  of  this  thesis  has  been  limited  to  a  maximum  of  128 
LPs  due  to  the  memory  required  for  the  data  structures  used. 

1.7  Thesis  Overview 

Chapter  2  reviews  several  general  approaches  to  solving  the  problem  of  efficiently 
mapping  N  tasks  onto  P  processors  as  found  in  the  current  literature.  Chapter  3  gives  the 
background  on  the  implementation  of  VSIM  and  the  SPECTTRUM  simulation 
environment.  This  information  leads  into  a  discussion  of  the  specific  requirements  for  a 
parallel  VHDL  partitioning  strategy.  Implen^ntadon  of  this  strategy  is  discussed  in 
Chapter  4.  Chtqiter  5  discusses  the  research  oMthodology  and  results.  Finally,  Chapter  6 
presents  the  conclusions  formulated  during  this  research  and  gives  recommendations  for 
future  research. 

1.8  Summary 

The  need  for  this  research  stems  from  the  rapid  increase  in  the  size  and  complexity  of 
modern  large-scale  integrated  circuit  designs.  Current  commercial  VHDL  simulators 
execute  in  a  sequential  manner,  leading  to  long  design  cycles  for  extremely  large  circuits. 
One  approach  to  achieving  the  desired  speedup  is  through  distribution  of  the  simulation 
load  among  multiple  processors  in  a  parallel  system.  Previous  AFIT  research  has 
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validated  the  concept  of  parallel  VHDL  simulation  through  dte  develt^mient  of  VSIM. 
This  research  investigates  methods  of  partitioning  the  VHDL  circuits  among  the  parallel 
processors  in  (»der  the  maximize  the  speedup  obtained  through  parallelization. 
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//.  Background 


2.1  Overview 

This  chapter  presents  a  discussion  of  previous  research  relating  to  the  parallel 
program  Mapping  Problem  and  how  those  results  might  be  ^lied  to  the  specific 
problem  of  pardticming  structural  VHDL  circuit  descriptions  for  parallel  simulation.  To 
facilitate  this  discussion,  several  characteristics  of  an  ideal  partitioning  strategy  are  first 
presented. 

22  The  VHDL  Mapping  Problem 

22.1  The  Parallel  Programming  Mapping  Problem.  In  the  context  of  parallel 
programming  i^Ucations,  the  mopping  problan  is  defined  as  the  binding  of  the  logical 
ccmqMnents  of  the  parallel  iq)plicadon  program  to  the  physical  resources  of  the  target 
parallel  system  such  that  some  desired  perfcntnance  criterion  is  optimized  (22:141).  For 
example,  it  is  usually  desired  to  map  the  applicaticm  in  such  a  way  that  the  total  execution 
time  is  minimized.  Optimal  solutions  to  the  general  mapping  problem  have  been  shown  to 
be  NP-complete  and  no  polymnnial  time  algtvithm  for  their  solutitm  is  known  to  exist 
(17:63,  22:142).  As  a  result,  sub-optimal  solutions  are  often  pursued  using  various 
heuristic  methods  (22,  17).  The  logical-to-physical  binding  of  a  parallel  application 
controls  the  utilization  of  the  parallel  system  and  directly  affects  the  amount  of  time  aiul 
memmy  required  to  complete  program  execution. 

The  mapping  problem  arises  when  the  number  of  processes  (i.e.  ta^)  required  by  the 
parallel  application  is  greater  than  the  number  of  available  processors  {cardinality 
variation ),  or  when  the  task-dependency  structure  of  the  parallel  application  differs  frcmi 
the  physical  interconnection  structure  of  the  parallel  system  (topological  variation )  (2). 
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222  Partitioning  VS/M.  To  date,  no  effort  has  been  made  at  developing  an 
optimal  (or  near-optimal)  partitioning  strategy  for  mapping  VHDL  behaviors  to  fully 
exploit  the  parallelism  available  in  large  VHDL  simulations  using  the  VSIM  simulator 
(4).  In  the  parallel  VHDL  simulation  environment  created  by  VSIM,  cardinality  variation 
can  be  dealt  with  by  grouping  VHDL  behaviors  into  logical  processes  (LPs)  which 
comprise  the  concurrent  tasks  managed  by  the  SPECTRUM  testbed.  A  large  VHDL 
circuit  may  contain  hundreds  of  thousands  of  behaviors  with  a  complex  interdependency 
structure.  Assuming  the  assignment  of  1  LP  per  physical  processor,  there  is  likely  to  be 
hundreds,  or  even  thousands,  of  behaviors  per  LP.  As  a  result,  the  behavior  grouping,  or 
partitioning,  is  likely  to  be  a  critical  factor  in  the  relative  performance  of  the  parallel 
simulation. 

The  two  key  objectives  of  most  strategies  that  have  been  proposed  for  the  general 
parallel  program  mapping  problem  are  achieving  a  balanced  computation  load  among  all 
of  the  processors,  and  minimizing  the  inter-processor  communication.  The  former  deals 
with  making  efficient  use  of  all  of  the  processor  resources,  while  the  latter  deals  with 
reducing  non-productive  overiieads  such  as  message  setup  and  transfer  times. 

The  general  mapping  problem  can  be  divided  into  two  sub-problems:  job  scheduling 
and  task  allocation  (21:1408).  The  goal  of  job  scheduling  is  to  obtain  maximum  system 
utilization  by  scheduling  independent  jobs  among  the  processors  in  a  distributed  system. 
This  involves  a  dynamic  schedi  ling  ability  as  old  jobs  are  completed  and  new  ones  are 
submitted.  In  contrast,  the  task  allocation  problem  involves  the  allocation  of  several  inter¬ 
dependent  tasks  of  a  single  program  among  the  processors  in  a  distributed  or  parallel 
system.  The  goal  of  task  allocation  is  to  minimize  the  completion  time  of  the  single 
application  program.  The  task  allocation  problem  has  been  approached  separately  as  both 
a  dynamic  and  static  allocation  problem,  with  the  latter  being  desirable  if  the 
interdependencies  of  the  task  structure  can  be  statically  defined  a  priori  (21:1409).  In  the 
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VSIM  environment,  each  individual  VHDL  behavior  is  equivalent  to  a  task.^  In  addition, 
the  inter-dependency  structure  of  the  behaviors  is  known  prior  to  simulation  and  is  static. 
Therefore,  throughout  this  thesis,  discussion  of  the  mapping  problem  implies  the  static 
task  allocation  problem. 

2.3  Characteristics  of  an  Ideal  Partitioning  Strategy 

Befwe  evaluating  various  heuristic  solutions  to  the  general  parallel  program  mapping 
problem,  it  is  useful  to  discuss  some  characteristics  of  an  ideal  partitioning  strategy  that 
can  be  used  for  comparison  purposes.  In  the  context  of  this  thesis,  the  phrase  ideal 
partitioning  strategy  is  used  as  it  applies  to  the  specific  problem  of  parallel  discrete  event 
simulation  (PDES)  for  large  VHDL  circuits  using  VSIM.  It  is  reasonable  to  expect  that 
such  a  partitioning  strategy  will  be  equally  applicable  to  other  parallel  problems  that  have 
similar  static  task  dependency  characteristics.  The  desirable  properties  of  an  ideal  parallel 
VHDL  partitioning  strategy  include  the  following: 

•  Computational  Efficiency  -  The  partitioning  algorithm  should  be 
computationally  efficient,  requiring  only  polynomial  time  to  converge  to  a  good 
solution.  Finding  the  optimal  solution^  to  the  general  mapping  problem  has  been 
shown  to  be  NP-complete,  thus  rendering  it  computationally  infeasible  to  seek 
such  a  solution  for  large  and  complex  problems  (17:63,  22:142).  Parallel 
algorithms  are  one  potential  means  of  achieving  the  necessary  computational 
efficiency,  although  numerous  sequential  algorithms  have  been  proposed  as  well. 


^  Throughout  this  thesis,  the  terms  behavior,  process,  and  task  are  used  interchangeably  to  represent  the 
vertices  of  the  problem-graph. 

^  In  the  context  of  this  thesis,  the  optimal  solution  is  defined  as  the  mapping  that  results  in  the  fastest 
simulation  for  a  given  number  of  processors.  A  good  solution  is  defmed  as  any  solution  that  results  in 
“near-optimal”  simulation  run  times. 
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•  Balanced  Workload  -  The  partiticming  alg(»ithm  should  result  in  a  balanced 
computation  load  among  all  available  processors  (S:294).  This  requires  that  the 
percentage  of  time  spent  performing  useful  computatimis  be  iq)proximately  equal 
for  each  process^. 

•  Exploitation  of  Inherent  Parallelism  -  The  partitioning  algorithm  should 
produce  solutions  which  take  advantage  of  the  parallelism  inherent  in  the 
simulation  application. 

•  Minimized  Inter-Processor  Communications  -  The  partitioning  algorithm 
should  produce  solutions  with  minimal  communications  between  processors 
(5:294).  The  two  primary  factors  to  consider  here  are  the  number  of 
communication  links  between  tasks  on  different  processors,  and  the  relative 
frequency  with  which  messages  are  sent  over  those  links. 

•  Scalability  -  The  partitioning  algorithm  should  be  easily  scalable,  both  in 
terms  of  the  number  of  tasks  in  the  problem  graph,  and  the  number  of  processors 
(5). 

•  Deterministic  Solutions  -  The  partitioning  algorithm  should  produce 
deterministic  solutions  which  are  based  upon  the  known  static  inter-dependency 
structure  of  the  problem  graph. 

•  Ir^ut  Problem  Variations  -  The  partitioning  algorithm  should  be  applicable  to 
a  wide  variety  of  VHDL  circuits,  including  those  with  feedback  loops. 

•  Accounts  for  PDES  Synchronization  Protocol  -  Ideally,  the  partitioning 
algorithm  should  be  equally  applicable  regardless  of  the  particular  parallel 
discrete  event  simulation  protocol  used.  However,  this  is  not  feasible  since  the 
simulation  protocols  play  a  major  role  in  defining  the  amount  of  inter-processor 
communications.  Instead,  given  a  specific  simulation  synchronization  protocol. 
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the  partitioning  algorithm  should  account  for  its  overiiead  requirements  when 
making  decisitxis  regarding  task  partitioning. 

2.4  General  Approaches  to  the  Mapping  Problem 

Numerous  approaches  to  the  static  task  mapping  problem  have  been  pursued 
including  algorithms  based  on  graph  theory,  mathematical  programming,  queueing 
theory,  and  various  heuristic  aiq>roaches  such  as  simulated  annealing  (21:1409).  Two 
specific  graph-based  models  which  have  been  proposed  for  modeling  the  static  task 
allocation  problem  involve  the  use  of  a  task  precedence  graph  (TPG)  and  a  task 
interaction  graph  (TIG).  The  task  precedence  gnqih  model  consists  of  a  directed  gnqih  in 
which  the  vertices  represent  tasks  and  the  edges  represent  inter-task  execution 
dependencies.  Computational  and  communication  costs  are  represented  by  adding 
weights  to  the  vertices  and  edges  of  the  graph.  A  task  interaction  graph  has  the  same  basic 
structure.  However,  a  task  precedence  graph  models  execution  precedence  dependencies 
whereas  a  task  interaction  graph  models  the  need  f(xr  inter-task  communicatitHi  without 
explicitly  representing  such  temporal  dependencies.  All  tasks  are  considered 
independently  and  concurrently  executable.  In  both  models,  the  goal  is  to  map  the  tasks  to 
processors  so  as  to  minimize  the  total  program  execution  time  (21:1409). 

The  class  of  parallel  problems  that  can  be  modeled  by  a  task  interaction  graph  (TIG) 
consists  primarily  of  iterative  algorithms  in  which  all  tasks  can  execute  independently 
during  each  iteration,  and  exchange  data  values  only  in  between  iterations  (21:1409). 
Many  algorithms  for  matrix  manipulations  fit  into  this  category. 

On  the  other  hand,  problems  which  exhibit  tenqxnal  dependencies  (e.g.,  task  B  cannot 
execute  until  task  A  has  executed)  between  tasks  can  be  modeled  with  the  task 
precedence  graph  (TPG).  The  temporal  dependencies  modeled  by  the  directed  arcs  of  the 
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Figure  2.  Iterative  PDFS  Algorithm 


TPG  define  a  series  of  tasks  that  must  be  ex«mted  sequentially,  making  these  problems 
inherendy  difficult  to  parallelize. 

The  set  of  iterative  problems  encompassing  parallel  discrete  event  simulation  (PDFS) 
can  be  rqnesented  by  the  TPG  model  by  considering  the  tenqxxal  dependencies  in  terms 
of  a  single  iteration,  and  overlapping  the  iterations  in  a  p4>eline  fashicm.  Figure  2  shows 
an  example  for  three  sinqtle  tasks  over  rive  iterations,  with  the  assumption  that  each  task 
is  on  a  separate  processor.  Task  A  produces  a  series  of  rive  data  values,  each  of  which  is 
acted  on  separately  by  task  B.  In  turn,  task  B  produces  a  series  of  rive  data  values,  each  of 
which  is  acted  on  separately  by  task  C.  Because  task  A  has  no  dependencies,  it  can  run 
independently  to  completion.  However,  task  B  cannot  begin  its  operation  on  the  first 
iteration  until  the  first  input  has  been  received  from  task  A.  A  similar  relationriiip  holds 
between  tasks  B  and  C.  For  illustrative  purposes,  each  task  requires  a  different  amount  of 
time  to  perform  its  computation  on  the  data  values  flowing  through  the  system  as 
indicated  in  the  figure.  The  numbers  in  the  rectangles  represent  the  iteratitxi  that  each  taric 
is  on  at  a  given  point  in  time,  with  time  not  covered  by  a  rectangle  representing  idle  time 
for  that  task.  Note  that  because  this  is  an  event-driven  simulation,  the  three  tasks  are  not 
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Mcessarily  executing  in  lockstep.  For  exan^le,  task  A  begins  its  third  iteration  before 
task  B  ctxnpletes  its  first  iteration. 

TPG  noodel  is  used  throughout  this  thesis  to  model  the  VHDL  mapping  problem^ 
by  modeling  the  individual  VHDL  behaviors  as  graph  vertices  and  the  into'-behavior 
dependencies  as  directed  edges.  Further  discussion  of  this  model  can  be  found  in  the 
requirements  section  of  chapter  3.  The  remainder  of  this  chapter  examines  numerous 
aspects  of  various  partitioning  schemes  using  graph-theoretic  techniques  that  have  been 
proposed  for  a  variety  of  parallel  problems. 

2.4.1  Random  Partitioning.  Random  Partitioning  involves  the  random 
distribution  of  the  tasks  into  the  desired  number  of  LPs,  and  was  the  partitioning  scheme 
used  for  prior  AFTT  research  cm  circuits  with  mme  than  100  behaviors  (4).  It  is  one  of  the 
simplest  partitioning  algorithms,  but  potentially  the  most  ineffective.  Under  this 
approach,  only  the  load  balancing  among  the  LPs  is  considered.  The  behavior 
dependencies,  and  associated  communicatitMis  costs,  are  ignored.  Breeden  shows  that  in 
some  limited  circumstances,  speedup  can  be  obtained  with  this  partitimiing  scheme 
(4:70).  However,  because  the  behavior  dependencies  are  not  considered,  the  resulting 
partition  has  a  large  number  of  inter-behavior  dependencies  that  cross  the  partition 
boundaries,  and  often  has  artificial  feedback  imposed  upon  it  The  situation  worsens  as 
the  number  of  LPs  grows.  As  a  result  this  partitioning  strategy  is  not  likely  to  scale  very 
well  or  provide  very  good  performance.  This  conclusion  is  supprated  by  the  limited  data 
available  from  previous  research  (4:70). 

2.42  Sin:pk  Data  Partitiomng.  A  slightly  better  algorithm,  but  just  as  simple  as 
the  random  partitioning,  is  referred  to  as  Single  Data  Partitioning  (SDP)  (9:78).  Under 

^  The  VHDL  mapping  problem  in  this  thesis  is  OHisidered  relative  to  the  VSIM/SPECTRUM  simulation 
environment. 
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the  SDP  approach,  each  vertex  in  the  graph  has  a  weight  associated  with  it,  with  the 
weight  calculated  as  the  degree  of  the  vertex  (number  of  arcs  to  and/or  from  the  vertex). 
Each  partition  is  filled  one  at  a  time,  with  vertices  selected  for  inclusimi  by  decreasing 
order  of  their  weight  until  the  combined  weight  of  the  current  pardticm  is  aj^Hoximately 
equal  to  the  calculated  average  weight  (9:78).  The  result  is  that  vertices  with  a  high  in/out 
degree  will  tend  to  be  grouped  together  in  piutitions  with  fewer  vertices.  The  method  of 
selecting  which  of  several  vertices  with  the  same  weight  is  not  specified,  and  is  assumed 
to  be  arbitrary.  As  a  result,  it  is  reasonable  to  expect  that  for  graphs  in  which  a  large 
portion  of  the  vertices  have  equal  weights,  results  similar  to  those  for  random  partitimiing 
would  be  achieved. 

Since  each  arc  in  the  graph  represents  a  potential  for  inter-task  communications,  a 
vertex  with  a  high  in/out  degree  is  likely  to  have  more  inter-task  communications  than  a 
vertex  with  a  small  in/out  degree.  Thus,  on  the  surface  it  seems  as  though  grouping 
vertices  with  a  potential  for  large  amounts  of  communications  in  the  same  partition  would 
tend  to  minimize  inter-partition  communications.  The  fallacy  of  this  approach  is  that  a 
group  of  vertices  with  high  in/out  degree  may  in  fact  result  in  a  large  amount  of 
communications,  but  not  necessarily  with  other  vertices  in  the  same  partition. 

2.43  General  Graph  Contraction  &  Layout  Algorithm.  An  alternative  approach  to 
the  mapping  problem  attempts  to  address  both  cardinality  variation  and  tcq>ological 
variation  at  the  same  time  by  modeling  a  given  parallel  algorithm  with  a  family  of  graphs 
(Gn).  Each  graph  Gn  represents  the  static  dependency  graph  of  a  parallel  algorithm  for  a 
problem  of  size  n.  A  similar  graph  Gp  models  the  physical  processors  and  interconnection 
structure  of  the  target  architecture  for  P  processors  (2:307,  3:441).  Given  a  problem  of 
size  n  represented  by  graph  Gn,  the  proposed  approach  involves  contracting  the  graph  Gn 
into  a  smaller  graph  Gk  of  size  k  from  the  same  graph  family.  The  contraction  process  is 
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Figure  3.  Contraction  within  a  grq>h  family.  Vertices  incident  to  dashed  edges  are 

grouped  into  a  single  vertex  (3:449) 

continued  until  k  »  P,  thereby  eliminating  the  cardinality  variation.  The  next  stq>  is  to  lay 
out  the  contracted  graph  Gk  onto  the  physical  interconnection  graph  Gp,  thereby 
eliminating  the  tcqwlogical  variation.  The  tinal  step  in  die  algorithm  uses  multiplexing  ro 
implement  the  problem-graph  Gn  <m  the  image  of  Gk  (2:307,  3:441).  Three  examples  of 
contracting  an  algorithm  represented  by  Gn  into  a  graph  from  the  same  family  Gk  are 
shown  in  Figure  3. 

In  the  specific  case  of  the  VSIM/SPECTRUM  simulation  environment,  this  approach 
could  be  improved  by  encapsulating  each  vertex  on  the  contracted  graph  Gk  inside  of  a 
single  logical  process  (LP),  thereby  avoiding  the  communications  and  context  switching 
overheads  associated  with  multiplexing  multiple  tasks  on  a  single  processor.  Even  then, 
this  approach  makes  two  implicit  assumptitxis  about  the  problem  domain  that  severely 
limits  its  feasibility  as  a  solution  to  the  mapping  problem  for  parallel  VHDL  simulation. 


16 


O — O — O- — o — 

% 


0-0 — O . o — o'* 

Figure  4.  Alternative  Gmtraction  for  Gny)h(rf  Figure  3^ 

First,  it  assumes  that  the  dependency  gn^h  of  the  parallel  algorithm  (in  this  case  the 
simulation  of  a  structural  VHDL  circuit),  will  exhibit  a  patten  that  remains  oonsistmt  as 
the  problem  size  grows  larger.  Second,  it  assumes  that  there  exists  a  graph  €  (Gn) 
whose  cardinality  and  topological  layout  are  the  same  as  that  of  the  physical 
interconnection  gnq>h  Gp. 

As  a  firud  observation  ccmcerning  this  approach,  it  should  be  noted  that  even  if  bodi 
of  these  assumptions  hold  and  this  methodology  can  be  used  to  map  the  prDblem-gn4>h 
onto  the  processor-gn^h,  it  may  or  may  not  result  in  the  most  efficient  inter-processcv 
communications  structure.  If  tasks  on  the  same  processor  are  enciqrsulated  witiiin  a  single 
LP,  external  messages  will  not  be  required  for  two  such  tasks  to  communicate. 
Therefore,  each  dashed  edge  in  Figure  3  represents  a  potential  inter-processor 
communications  link  that  has  been  eliminated  in  the  contracted  graph.  CkMisideting  the  8- 
node  graph  of  Figure  3.c  which  is  contracted  into  4  IPs,  there  is  clearly  no  possible  way 
that  this  contraction  could  be  done  such  that  more  than  four  edges  are  eliminated  without 
sacrificing  load  balancing.  However,  consider  the  1 1-node  graph  of  Figure  3.a  which  is 
contracted  into  5  LPs  with  the  elimination  of  only  two  edges.  Figure  4  shows  an 
alternative  contraction  that  would  eliminate  six  edges,  thus  reducing  the  potential  inter- 
processOT  communications. 

2.4.4  Strip  Assignment  Algorithm.  Two  other  approaches.  Strip  Asrignmenr  and 
Two-Dimensional  Mapping,  have  been  proposed  for  the  specffic  problem  of  mailing 
metalforming  applications  using  finite  element  methods  onto  a  hypcxcube  architecture 
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(22,  21).  Both  of  these  heuristic  apfnoaches  attempt  to  address  the  load-balancing  and 
inter-processor  communications  aspects  (A  the  mapping  problem  separately.  Load¬ 
balancing  is  addressed  by  attempting  to  allocate  an  equal  number  of  tasks  to  each 
processor.  This  is  analogous  u>  allocating  an  equal  number  of  VHDL  behaviors  to  each 
Logical  Process  (LP),  with  one  LP  per  processw.  The  inter-processor  communications 
overiwad  is  addressed  in  two  ways.  First,  the  algorithms  attempt  to  group  tasks  together 
on  the  same  processor  such  that  the  amount  of  inter-processra'  cmnmunications  requirrd 
is  minimal.  Second,  each  algorithm  attempts  to  allocate  the  task  groups  among  the 
processors  such  that  only  nearest-neighbor  communications  are  ever  required  (22, 21). 

The  strip  assignment  method  evenly  distributes  the  tasks  among  the  available 
processors  in  such  a  way  that  each  processor  will  only  need  to  communicate  with  no  mme 
than  two  immediately  adjacent  neighbor  processors.  Letting  N  be  the  total  number  of 
tasks  in  the  problem-graph  and  P  be  the  total  number  of  processors,  the  number  of  tasks 
per  processcsr  to  achieve  a  balanced  load  is  fN/Pl  for  some  processors,  and  IN/Pj  for  the 
remainder  (22:144, 21:1414).  An  example  for  N  =  48,  P  s  6  is  shown  in  Figure  5.  Letting 
Nc  be  the  maximum  number  of  tasks  in  any  column  of  the  problem-mesh,  and  Nr  be  the 
maximum  number  of  tasks  in  any  row,  the  cmler  in  which  tasks  are  added  to  a  partition 
depends  on  the  relative  magnitudes  of  Nc  and  Nr.  Beginning  at  any  comer  of  the 
problem  mesh,  tasks  are  added  to  the  current  partition  along  the  columns  if  Nc  ^  Nr,  ot 
along  the  rows  if  Nr  ^  Nc.  Subsequent  partitions  begin  where  the  previous  one  left  off 
(22:145). 

The  strip  assignment  method  assumes  that  the  graphical  representation  of  the  problem 
can  be  represented  by  a  two-dimensional  mesh  layout  in  which  each  task  has  at  most  four 
communication  paths  to  its  nearest  neighbor  tasks.  In  addition,  in  order  for  the  strip 
method  to  guarantee  that  nearest-neighbor  communications  are  maintained  among  the 
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Figure  5.  Exaiiq)le  of  the  Strip  Assignment  Method  for  a  Problem-Mesh  (22:143) 

physical  processors,  a  certain  constraint  mi  the  problem-mesh  must  be  met  (22:144, 
21:1414).  Specifically,  the  strip  method  requires  that: 

LN/PJ  >  min(Nc,NR) 

It  is  reasonable  to  expect  that  very  few,  if  any,  VHDL  circuit  dependency  gnqihs  will 
ever  meet  the  two-dimensimial  mesh  layout  requirement  Thus,  this  algorithm  does  not 
seem  suitable  for  the  partitioning  of  structural  VHDL  circuit  simulations. 

2.4  J  Two-Dimensional  Mapping  Algorithm.  The  other  method  that  has  been 
applied  to  metalforming  applications  using  finite  element  methods  is  Two-Dimensional 
Mapping  (22:145).  This  tqiproach  differs  from  the  Srrip-Arsig/ime/ir  method  in  that  an 
attempt  is  made  to  reduce  the  number  of  nearest-neighbor  communication  links  between 
processors.  As  can  be  seen  ffom  Figure  5,  the  Strip-Assignment  nwthod  results  in 
partitions  that  span  the  entire  problem-mesh  (either  column-wise  or  row-wise).  As  a 
result,  a  large  number  of  communications  links  in  the  problem-mesh  cross  the  partition 
boundaries.  The  Two-Dimensional  Mapping  {q)proach  capitalizes  on  the  fact  that  square 
partitions  will  have  a  smaller  perimeter-to-area  ratio  than  the  '^rectangular”  partititms  of 
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Figure  6.  Initial  Partition  of  a  Two-Dimensional  Mapping  for  a  Problem-Mesh 

the  Strip-Assignment  method,  and  thus,  should  result  in  fewer  communications  links 
crossing  the  partition  boundaries.  Schwan  et  al.  suggest  a  three-step  approach  (22: 145): 

•  Divide  the  problem-mesh  into  square  partitions  as  if  the  problem-mesh  were 
perfectly  regular.  Depending  on  the  dimensions  of  the  problem  mesh,  some 
partitions  may  only  approximate  squares. 

•  Use  a  bender-refinement  algorithm  to  account  for  an  irregular  problem-mesh  and 
achieve  load-balancing. 

•  Use  a  secondary  refinement  algorithm  to  attempt  further  minimization  of  inter- 
processor  communications  while  maintaining  a  balanced  load.  This  step  is 
optional. 

Figures  6  and  7  demonstrate  a  two-dimensional  mapping  for  the  same  problem-mesh 
as  in  Figure  5.  First,  Figure  6  shows  the  initial  two-dimensional  partition  consisting  of 
simple  horizontal  and  vertical  lines  through  the  problem-mesh.  Visualizing  the  problem- 
mesh  as  being  perfectly  regular  (all  columns  have  Nc  processes  and  all  rows  have  Nr 
processes),  the  lines  are  placed  so  that  the  resulting  partitions  are  as  close  to  being 
identically  sized  squares  as  possible.  However,  because  the  problem-mesh  is  not  perfectly 
regular,  the  resulting  partitions  are  not  evenly  balanced.  The  second  step  in  the  algorithm 
requires  moving  processes  from  one  partition  to  a  neighboring  partition  until  all  partitions 
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Hgure  7.  Two-Dimensional  Mapping  Partitions  after  Border-Refinenaent 

are  balanced.  The  resulting  partitions  are  shown  in  Figure  7.  Note  that  in  the  strip- 
assignment  partitions  of  Figure  5,  there  were  32  inter-partition  communication  links.  In 
the  two-dimensional  partition,  the  number  of  inter-partition  communication  links  has 
been  reduced  to  21  while  maintaining  a  balanced  load. 

An  additional  benefit  of  two-dimensional  mapping  over  strip-assignment  is  that  the 
former  does  not  impose  any  constraints  on  the  number  of  processors  P  as  it  relates  to  the 
dimensions  of  the  problem-mesh  Nr  and  Nc  .Hius,  it  can  be  applied  to  problem-meshes 
where  strip-assignment  is  not  applicable.  Like  strip-assignment,  however,  this  approach  is 
only  applicable  to  problems  that  can  be  represented  as  a  two-dimensional  mesh.  Thus,  as 
it  is  presented  here,  two-dimensional  mapping  does  not  seem  suitable  for  the  partitioning 
of  structural  VHDL  circuit  simulations.  However,  the  idea  of  performing  load  balancing 
and  reducing  the  inter-partition  communication  costs  by  refining  the  partition  boundaries 
can  be  extended  to  algorithms  that  can  be  applied  to  other  forms  of  problem  graphs. 
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2.4.6  Algorithm  M,  An  Optimal  Approach.  Another  approach  to  the  mailing 
problem.  Algorithm  M ,  has  been  proposed  for  static  assignment  of  tasks  in  a  distributed 
system  in  which  the  processors  communicate  over  an  ethemet-based  medium  (16).  In 
such  a  system,  all  processors  communicate  over  a  shared  pathway  for  which  all 
processors  must  compete.  As  a  result,  the  processor  interconnection  graph  Gp  is  fully 
connected,  and  no  special  steps  are  required  to  layout  the  partitioned  problem-graph  (ji^ 
onto  the  processor  graph  Gp  (16:240). 

Lo  argues  that  the  goals  of  a  static  partitioning  algorithm  fen-  a  distributed  system  are 
different  from  those  of  a  static  partitioning  algorithm  for  a  parallel  system  given  an 
identical  problem  domain  (16:240).  However,  in  both  systems,  the  inter-processor 
communications  (IPC)  should  be  minimized  while  meeting  some  load  balancing 
constraint  (e.g.  equal  number  of  processes  per  processor).  The  two  systems  differ 
primarily  in  the  relative  cost  of  inter-processor  communications  in  relationship  to  the  cost 
of  some  load  imbalance.  However,  these  relative  cost  differences  also  exist  between 
different  classes  of  parallel  systems,  and  even  between  different  applications  on  the  same 
parallel  system.  Thus,  the  goals  of  partitioning  problem-graphs  for  distributed  systems 
and  partitioning  them  for  parallel  systems  are  actually  the  same.  It  is  only  the  mapping  of 
the  partitions  onto  the  physical  processors  that  differs,  and  then  only  if  one  is  conconed 
about  maintaining  nearest-neighbor  communications  in  parallel  systems.  As  a  result,  an 
effective  partition  for  one  type  of  system  is  likely  to  be  just  as  effective  on  the  other  type 
of  system.  Given  this  fact.  Algorithm  M  can  be  studied  as  it  might  apply  to  partitioning 
problem-graphs  for  parallel  systems. 

Algorithm  M  has  been  shown  to  provide  an  optimal  partitioning  of  a  problem-graph 
Gn  onto  P  identical  processors  in  polynomial  time  providing  that  the  number  of  tasks  n  in 
Gn  is  no  mOTe  than  twice  the  number  of  processors  (n  ^  2P),  and  providing  that  no  more 
than  two  tasks  can  be  assigned  to  any  single  processor  (16:241).  Although  this  restriction 
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makes  this  appnrach  unsuitable  for  parallel  VHDL  simulations  which  may  have  hundreds 
to  tens  of  thousands  of  tasks  per  processor,  it  is  discussed  here  because  it  leads  into  the 
sub-optimal  heuristic  approximation  discussed  in  the  next  section  which  removes  the 
restrictimi  on  n. 

Algmithm  M  begins  by  flnding  a  maximum  weight  matching  of  the  problem  graph 
Gn.  A  matching  is  defined  as  a  set  of  edges  from  a  graph  such  that  no  two  edges  in  the 
set  share  a  common  vertex.  The  sum  of  all  the  weights  of  the  edges  in  the  matching  forms 
the  weight  of  the  matching.  The  maximal  weight  matching  is  the  matching  of  the  gnqrh 
with  the  largest  weight.  Algorithms  exist  to  find  the  maximal  weight  matching  of  a  graph 
in  polynomial  time  (16:241). 

After  the  maximal  weight  matching  has  been  found,  the  next  step  is  assigning  the  two 
tasks  corresponding  to  each  edge  in  the  matching  to  a  processor  with  no  other  tasks 
assigned  to  it  (16:241).  It  should  be  noted  that  a  maximal  weight  matching  may  not 
necessarily  contain  all  of  the  tasks  in  the  problem-graph.  Tasks  that  are  not  connected  by 
an  edge  in  the  matching  are  arbitrarily  paired  and  assigned  to  a  processor  with  no  other 
tasks  assigned  to  it.  Finally,  if  there  remains  a  single  unpaired  task,  it  is  assigned  by  'tself 
to  any  remaining  processor  that  has  no  other  tasks  assigned  to  it.  Lo  states  that  this 
algorithm  will  provide  an  optimal  partition  for  a  given  input  problem-graph  that  meets  the 
constraint  n  ^  2P  (16:241).  If  the  problem  is  such  that  communications  costs  between 
some  pairs  of  tasks  will  be  greater  than  between  other  pairs  of  tasks  (due  to  frequency  of 
messages,  size  of  messages,  etc.),  this  algorithm  has  the  interesting  property  that  those 
pairs  of  tasks  with  the  highest  communications  costs  will  be  assigned  to  the  same 
processor  where  communications  costs  are  negligib^  - 

As  mentioned  previously,  the  restriction  n  ^  2P  umits  the  class  of  problem-graphs  that 
may  be  partitioned  using  this  algorithm.  An  additional  shortfall  of  the  algorithm  is  the 
implicit  assumption  that  each  task  will  have  an  equal  weight  (in  terms  of  computational 
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cost).  Depending  on  the  extent  to  which  this  may  not  be  true,  this  further  limits  the  class 
of  problems  for  which  this  algorithm  will  result  in  a  partition  that  is  truly  optunal  in  terms 
of  both  inter-pnxessor  communications  and  load  balancing. 

2.4.7  Algorithm  H,  A  Heuristic  Approximation  of  Algorithm  M.  Algorithm  H 
represents  a  sub-optimal  approximation  of  Algorithm  M  for  the  polynomial-time 
partitioning  of  an  arbitrary  number  of  tasks  n  among  P  processes  with  a  bound  B  on  the 
maximum  number  of  tasks  per  processor  where  Fn/Pl  ^  B  n  (16:242).  This  is 
accomplished  by  first  reducing  the  original  problem  graph  Gn  with  n  tasks  to  a  smaller 
graph  Gk  with  k  nodes  using  a  “Sort  Greedy  Algorithm”  so  that  k  ^  2P  and  with  Gt 
having  no  more  than  rB/2l  tasks  per  node  (16:243).  An  optimal  partitioning  for  the 
reduced  graph  Gk  can  then  be  obtained  ming  Algorithm  M.  However,  this  partition  may 
not  necessarily  be  optimal  for  the  original  problem-graph  Gn  (16:242). 

Lo’s  simulation  results  have  shown  that  Algorithm  H  performs  relatively  well,  find’iig 
an  optimal  partition  in  over  80%  of  the  test  cases  run  (16:243).  However,  this  data  was 
collected  on  only  a  small  set  of  problem-graphs  with  no  more  than  35  tasks  and  5 
processors.  In  addition,  because  a  greedy-type  algorithm  is  used  for  the  initial  graph 
reduction  from  Gk  to  Gn,  poor  assignments  may  result  when  the  problem-graph  contains 
relatively  uniform  communication  costs  (16:243).  This  is  likely  to  be  the  case  in  large 
structural  VHDL  circuit  simulations.  Nevertheless,  the  concept  of  a  phased  approach  to 
partitioning  a  problem-graph  presented  by  this  algorithm  holds  potential  for  a  polynomial 
time  general-purpose  partitioning  algorithm  that  will  provide  near-optimal  solutions. 

2.4.8  Depth-First  Breadth-Next  Algorithm.  An  algorithm  called  Depth-First 
Breadth-Next  (DFBN)  has  been  proposed  to  partition  problem  dependency  graphs  (17). 
The  goals  of  this  algorithm  are  to  assign  dependent  tasks  to  the  same  processor,  and 
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Figure  8.  Example  Problem  Graph  for  DFBN  Partitioning  (17:64) 


independent  tasks  to  different  processes  (17:63).  The  luune  is  derived  from  the  maimer 
in  which  the  problem  graph  is  traversed  when  partitioning  dependent  tasks. 

Two  assumptions  concerning  the  set  of  applicable  problem-graphs  are  made  for  the 
DFBN  algorithm.  First,  it  is  assumed  that  the  graphs  are  acyclic.  Second,  it  is  assumed 
that  the  tasks  execute  non-preemptively,  and  must  run  to  completion  once  they  begin 
execution  (17:65).  In  the  example  problem  graph  of  Figure  8,  each  edge  is  labeled  with  a 
total  communications  time,  and  each  task  is  labeled  with  a  total  execution  time.  These 
values  correspond  the  weights  of  the  edges  and  tasks  respectively,  and  are  used  in 
prioritizing  non-critical  tasks  on  the  same  processor  for  scheduling  purposes  (17:69).  A 
critical  task  is  one  that  must  be  executed  before  any  others  in  order  for  any  progress  to  be 
made  (e.g.,  task  0  is  critical). 

The  actual  DFBN  partition  ignores  the  task  and  edge  weights,  and  considers  only  the 
interdependency  relationships  between  the  tasks.  Since  the  edges  in  the  graph  represent 
temporal  dependencies  and  the  assumption  has  been  made  that  the  tasks  run  non- 
preemptively  to  completion,  simultaneous  execution  of  tasks  located  on  the  same  path  is 
not  allowed.  Thus,  each  path  in  the  graph  represents  a  set  of  dependent  tasks,  and  tasks 
that  are  not  on  the  same  path  are  independent  (17:68).  The  goal  of  assigning  dependent 
tasks  to  the  same  processor  may  not  be  completely  achievable  since  two  independent 
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tasks  may  have  common  dependencies  (17:67).  For  example,  in  the  gn^h  of  Figure  8, 
tasks  1,  2,  and  3  are  all  independent,  but  all  share  the  common  dependency  on  task  0. 
Nevertheless,  it  may  be  possible  to  achieve  a  good  ^iproximation  of  tlw  stated  goal. 

The  partitioning  algmithm  simply  performs  a  depth-first  search  of  the  problem-graph, 
beginning  at  the  source  nodes.  When  a  task  has  been  discovered  by  the  search,  it  is 
marked  so  that  it  will  not  be  included  as  part  of  more  than  one  path.  Acccwding  to  the 
DFBN  algorithm,  each  distinct  path  in  the  problem-graph  resulting  ftom  die  DFBN 
search  is  assigned  to  a  different  processor  (17:68).  However,  this  assumes  that  the  number 
of  distinct  paths  will  be  less  than  or  equal  to  the  number  of  available  processors,  which 
may  not  be  true.  Furthermore,  the  DFBN  algorithm  makes  no  effort  to  balance  the 
computation  load  among  the  processors.  Some  progress  is  made  towards  limiting  inter¬ 
processor  communications  since  communications  between  dependent  tasks  of  the  same 
path  are  on  the  same  processor.  However,  this  minimization  is  likely  to  be  unevenly 
applied  and  far  from  optimal. 

2.4.9  Kernighan-Lin  Algorithm.  A  graph-partitioning  procedure  proposed  in 
1970,  known  as  the  Kernighan-Lin  algorithm  after  its  authors  (14),  is  often  a  standard 
used  by  others  for  comparison  with  their  own  partitioning  algorithms  (9:78).  The 
Kernighan-Lin  algorithm  divides  a  weighted  problem-graph  into  partitions  of  equal  cost 
and  with  a  minimum  cutset.  This  is  accomplished  by  first  making  an  arbitrary,  but  even, 
partition  of  the  problem-graph.  Tasks  are  then  swapped  between  partitions  until  a 
minimum  cutset  is  found,  when  the  partition  is  locally  minimum  (14:295).  [Depending  on 
the  nature  and  size  of  the  problem-graph  and  the  initial  partition,  the  resulting  cutset  may 
also  be  globally  minimum,  resulting  in  an  “optimal”  partition  (14:295, 9:78). 
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2.4.10  Simulated  Annealing.  Simulated  Annealing  (SA)  refers  to  an  algcnithm  that 
is  used  to  tnodel  groups  of  atoms  being  cooled  to  a  ground  state,  using  the  omcept  of  a 
decreasing  temperature  to  aid  in  the  convergence  of  a  solution  (9:79).  When  applied  to 
the  mapping  problem,  Simulated  Annealing  begins  with  a  randmn  initial  partition  of  the 
problem-graph,  and  involves  moving  tasks  to  other  partitions  if  the  total  cost  of  the 
partition  is  lowered.  The  partition  cost  equation  has  factors  for  the  inter-partition 
communication  costs  and  the  relative  cost  of  the  resulting  load-imbalance  amcmg  the 
partitions.  A  random  number  generator  is  used  to  select  both  the  process  to  move  and  the 
desti:  .tion  partition.  If  the  move  does  not  result  in  a  lower  total  cost  for  the  partition,  the 
nwve  may  still  be  made  based  on  a  decreasing  random  probability  functicm  (9:78). 

The  results  of  Conrad  and  Agrawal  have  shown  that  Simulated  Annealing  solutions  to 
the  mapping  problem  approach  optimal  solutions  more  often  than  the  Kernighan-Lin 
algorithm,  but  require  approximately  two  oiders-of-magnitude  more  time  to  reach  a 
solution  (9:78).  Another  shortfall  of  this  approach  is  its  non-determinism  which  is  due  to 
the  use  of  a  random-number  generator  to  select  the  proposed  moves  and  to  control  the 
probability  function,  as  well  as  the  randomness  of  the  initial  partition.  Depending  on  the 
initial  partition  and  the  sequence  of  proposed  nK>ves  selected  by  the  random  number 
generator,  it  is  possible  to  get  different  results  for  the  same  input 

2.4.11  Mean  Field  Annealing.  A  related  algorithm  to  Simulated  Annealing, 
CdX\ed  Mean  Field  Annealing  (MFA),  also  refers  to  an  algorithm  that  models  groups  of 
atoms  being  cooled  to  a  ground  state  (9:79).  As  in  Simulated  Annealing,  MFA  begins 
with  a  random  initial  partition  and  uses  a  random  number  generator  at  each  iteration  of 
the  algorithm  to  select  a  task  for  consideration.  The  state  of  the  selected  task  is  updated  to 
determine  the  proper  partition  assignment.  To  control  the  moves  and  force  a  convergence 
to  a  solution,  the  algorithm  uses  a  probability  function  that  takes  into  account  the  total 
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cost  of  the  partiticm  and  decreases  according  to  some  q)ecified  schedule  (5).  The  main 
advantage  of  this  algorithm  is  that  it  converges  to  a  solution  much  faster  than  Simulated 
Annealing,  requiring  an  execution  time  on  the  same  order  as  the  Kernighan-Lin 
algorithm  (9, 5). 

Derivation  of  the  MFA  algorithm  is  acconq>lished  by  analogy  to  the  Ising  spin  model 
which  estimates  the  state  of  a  system  of  particles,  or  spins,  in  a  state  of  thermal 
equilibrium  by  computing  an  energy  function  (5:295).  When  applied  to  die  mapping 
problem,  the  energy  function  is  interpreted  as  the  cost  of  a  given  partition.  This  cost 
function  can  be  computed  by  an  objective  function  H  containing  a  factor  for  the  inter- 
partition  communications  He,  and  a  factor  for  the  partition  load  imbalance  Hb  (9:78): 

H  =  H,  +  aH, 

where  the  parameter  a  is  a  scaling  factor  used  to  maintain  a  balance  between  the 
objectives  of  minimizing  communication  costs  and  maintaining  a  balanced  computation 
load  (5:297, 9:78).  The  factor  a  can  be  taken  as  an  input  parameter  to  control  the  level  of 
load  imbalance  that  is  acceptable  in  a  partition. 

A  function  spin  (i,  p)  is  defined,  with  output  Sip,  which  returns  the  probability  of 
mapping  taski  to  processor  p.  By  definition,  Sip  is  a  continuous  variable  in  the  range 
0  <  Sip  <  1.  However,  as  the  algorithm  reaches  a  solution,  the  spin  values  Sjp  converge  to 
either  a  0  or  a  1,  with  a  1  indicating  that  task  i  is  mapped  to  processor  p  (5:296).  Thus,  the 
solution  can  be  represented  by  an  N  x  P  spin  matrix  with  each  row  containing  P-1  zero 
entries.  An  example  for  N  =  8  and  P  =  4  is  shown  in  Figure  9. 

The  cost  function  H  is  defined  by  (5:296): 

H  =  H,  +  aH, 

=  s  i  *  I X  2 

*  i-i  j*i  p»i  q*p  ^  /«i  />(  p=i 

where  eij  represents  the  communications  cost  between  tasks  i  and  j,  w,  represents  the 
computation  cost  of  task  i,  and  dpq  represents  the  relative  communications  cost  per 
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P  processors 


12  3  4 

1  0  0  0  1 

2  0  0  0  1 

N  3  0  10  0 

processes  41  0  0  0 

5  0  0  1  0 

6  0  1  0  0 

7  1  0  0  0 

8  0  0  1  0 

Figure  9.  Example  Spin  Matrix  for  N  =  8  and  P  =  4  (5:296) 

message  between  processors  p  and  q.  In  (xder  to  calculate  the  functitm  spin  (i, p) ,  die 
mean  field  function  <|4p  is  defined  as  (S:2%): 

0  =  -1  -  “is*."'/'*', 

hi  J*i 

Each  individual  spin  average  Sip  is  proportional  to  e^  where  T  is  the  temperature  of  the 
system,  and  Sip  is  ncxmalized  as  (5:296): 

e*>/' 

This  normalization  forces  each  row  of  the  spin  matrix  to  sum  to  1,  and  ensures  that  each 
task  i  is  mapped  to  only  one  processor  when  the  system  stabilizes  for  a  given  temperature 
T  (5:296).  During  each  iteration  of  the  MFA  algorithm,  a  randomly  selected  row  i  in  the 
spin  matrix  is  recalculated  using  the  equations  for  (]Hp  and  Sip.  After  each  iteratimi,  the 
cost  function  H  is  recalculated  in  order  to  detect  a  convergence  to  an  equilibrium  state  for 
a  given  temperature  T.  If  the  cost  function  H  does  not  decrease  after  a  specified  number 
of  iterations,  the  system  is  said  to  be  stabilized  for  the  current  temperature.  The 
temperature  T  is  then  decreased  according  to  a  specified  schedule  (5:297).  As  T  is 
decreased  further,  the  probability  that  the  randomly  selected  task  i  will  be  moved  from  its 
currently  assigned  processor  decreases,  and  a  final  solution  is  eventually  reached. 
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The  results  of  Bultan  and  Aykanat  using  the  MFA  algcmthm  to  solve  die  mapping 
problem  have  shown  that  its  solutions  iy>[ffoach  the  quality  of  those  frmn  the  Sinudoied 
Annealing  algmithm  in  l/20th  the  time  (5:301).  However,  like  Simulated  Annealing,  the 
MFA  algorithm  has  the  undesirable  property  that  it  is  non-deterministic  due  to  the 
random  initial  partition,  and  the  random  selection  of  processes  to  update  from  the  spin 
matrix.  Nevertheless,  it  appears  to  be  a  viable  alternative  to  solving  the  m^ing  jvoblem 
fcnr  general  parallel  problems. 

2J  Summary 

The  diversity  of  methods  for  solving  the  general  mapping  problem  outlined  in  this 
chapter  represent  only  a  small  cross-section  of  those  possible.  None  of  the  methods 
presented  claim  to  provide  an  optimal  solution  for  all  problems.  However,  most  of  the 
methods  provide  good  solutions  for  a  specific  subset  of  problems. 

The  next  chapter  presents  a  partitioning  algorithm  that  draws  upon  properties  from 
several  of  the  algorithms  presented  in  this  chapter  in  an  attempt  to  find  a  scheme  that 
provides  consistently  good  partitions  for  a  wide  variety  of  structural  VHDL  circuit 
simulations. 
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///.  PnMem  Analysis 


3.1  Overview 

This  chapter  presents  a  discussion  of  the  grafrii-partitioning  algoridun  proposed  in  this 
thesis  for  mapping  VHDL  ciicuit  simulations  to  multiple  logical  processes  for  execution 
in  parallel.  In  (vder  to  understand  many  of  the  decisions  made  in  ftmnulating  the 
partitioning  algorithm,  it  is  necessary  to  first  understand  die  target  iqiplication.  Therefore, 
the  chapter  begins  with  a  review  of  the  implementation  of  AFTT’s  parallel  VHDL 
simulator,  VSIM,  as  implemented  by  Comeui  and  Breeden. 


32  Implementation  of  VSIM 

Previous  AFTT  research  has  resulted  in  the  successful  implementation  of  a  parallel 
simulator  for  simulating  structural  VHDL  circuits  on  an  Intel  hypercube  (8,  4).  This 
simulate,  known  as  VSIM,  uses  the  intermediate  C  source  code  created  by  the  sequential 
Intermetiics  commercial  simulatcv  as  shown  in  Figure  10  (4:21).  To  simulate  a  structural 
VHDL  circuit  using  VSIM,  the  Intermetrics  sequential  simulator  conqiiles  the  VHDL 
source  code  and  creates  an  IVAN  (Intermediate  VHDL  Attributed  Notation)  file 
containing  intermediate-form  code  descriptions  of  the  circuit  components.  Next,  during 
the  model  generate  phase,  the  Intermetrics  simulator  transforms  the  IVAN  file  into  C 
source  code  files  and  creates  the  corresponding  object  files  (8).  Using  a  tool  called 
pbuiid,  these  C  source  code  tiles  (and  their  associated  header  tiles)  are  intercepted  and 
transtimned  into  VSIM  compatible  tiles  that  can  be  executed  in  parallel  (4:20). 

VSIM  is  implemented  to  run  over  the  SPECTRUM  parallel  simulation  testbed  which 
manages  the  inter-process  synchronization  (4,  20).  VSIM  itself  is  independent  of  the 
parallel  discrete  event  simulation  (PDES)  synchronization  protocol  being  used.  To 
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maintain  continuity  with  prior  AFTT  research,  this  research  uses  the  ccMiservative  Chandy- 
Misra  synchronization  algorithm  with  null  messages^  to  provide  deadlock  avoidance  (7). 

The  structural  descriptions  and  simple  processes  representing  the  VHDL  subset 
implemented  by  VSIM  form  “behavioral  instances”  which  are  partitioned  into  groups  to 
form  Logical  Processes  (LPs)  (4:21).  The  LPs  are  in  turn  partitioned  arrnng  the  available 
processor  nodes  for  parallel  execution.  Each  LP  contains  a  copy  of  the  necessary 
application  code,  the  complete  SPECTRUM  code,  and  the  machine  dependent  operating 
system  interface  code  (12).  As  shown  in  Figure  1 1,  the  SPECTRUM  testbed  provides  an 
interface  between  the  application  code  and  the  rruichine  dependent  interface  to  the 
operating  system.  As  such,  the  SPECTRUM  code  provides  the  message  sending  and 
receiving  functionality  by  which  two  LPs  communicate  with  each  other.  Variants  of  the 

^  Null  messages  contain  no  signal  change  information.  They  exist  for  the  sole  purpose  of  advancing  the 
simulation  clock,  and  do  not  correspond  to  actual  events  in  the  physical  system  (7). 
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Figure  11.  SPECTRUM  Interface  for  a  Single  LP  (4, 12) 


simulation  protocol  used  to  maintain  synchronization  are  implemented  via  filters  as 
shown  in  Figure  11  (4,  12).  In  the  case  of  VSIM,  the  Chandy-Misra  null-message 
protocol  is  implemented  via  a  filter  called  vhdiciocks. 

In  addition  to  running  in  parallel  on  an  Intel  hypercube,  VSIM  also  has  a  seqiwntial 
mode  which  can  be  executed  on  a  Sun  Sparcstation.  In  fact,  VSIM’s  parallel  simulation 
cycle  is  a  direct  extension  of  its  sequential  simulation  cycle.  Furthermore,  prior  to 
executing  VSIM  on  the  hypercube,  the  circuit  must  be  executed  in  the  sequential  mode 
for  at  least  one  entire  simulation  cycle  in  order  to  extract  the  behavior  numbers  and  inter¬ 
dependency  relationships.  Therefore,  this  section  begins  with  a  discussion  of  the 
sequential  mode  of  VSIM.  This  is  followed  by  an  introduction  to  the  SPECTRUM 
parallel  simulation  testbed,  which  leads  into  a  discussion  of  how  the  sequential  simulation 
cycle  is  extended  in  VSIM’s  parallel  mode. 

32.1  Sequential  Simulation. 

32.1.1  Data  Structures.  The  four  fundamental  data  structures  used  in  VSIM’s 
sequential  mode  are  directly  derived  from  the  commercial  Intermetrics  simulator  and  are 
as  follows  (4:21-23): 
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•  BahaTiorai  znatanoM  -  Used  to  describe  the  input/ou^ut  behavior  of  each 
executable  VHDL  process.  A  separate  behavioral  instance  exists  for  each  VHDL 
behavior  in  the  simulation,  although  a  single  executitni  routine  may  be  shared  by 
several  instances  (4:21). 

•  signal  nacoxd  Tabla  -  Used  to  maintain  the  current  state  of  each  signal  in 
the  simulation  including  pointers  K)  behavioral  instances  in  order  to  identify  the 
dependency  connections  for  each  signal  (4:21-22). 

•  Baha^lor  zJ^t  -  Used  to  maintain  a  list  of  all  behavimal  instances  scheduled 
fw  execution  during  the  current  simulation  time.  All  behaviors  are  scheduled  for 
execution  (i.e.,  placed  in  the  behaviOT  list)  at  simulation  startup  (t  =  0)  in  rader  to 
initialize  the  values  of  all  signals.  When  a  behavior  is  executed,  it  is  removed 
from  the  behavior  list.  A  change  in  a  signal  value  will  cause  all  dependent 
behaviors  to  be  re-scheduled  for  executitm  (4:22). 

•  Activ*  Racozd  Liat  -  Used  as  a  next-event  list  for  the  simulator.  An 
“event"  is  defined  as  the  output  of  a  behavioral  instance  execution  routine  that 
may  result  in  a  signal  change.  Entries  in  the  list  are  maintained  in  increasing  order 
of  the  simulation  time  associated  with  each  scheduled  event  (4:23). 

Figure  12  shows  an  example  of  the  interrelationship  of  these  fundamental  data  structures. 
In  this  example,  signal  id  2  has  an  entry  on  the  active  record  list  because  it  is  changing 
values  from  a  *0’  to  a  *1’  at  simulation  time  SO.  The  active  record  list  entry  contains  the 
new  value  and  a  pointer  to  the  specific  signal  record.  The  signal  record  contains  pointers 
to  the  signal’s  current  value  in  main  memory  and  the  affected  behavior  instances  - 
behaviors  2  (AND  gate)  and  3  (XOR  gate)  in  this  example.  These  affected  behaviors  are 
in  turn  added  to  the  behavior  list  for  execution  at  time  SO  (4:23,  8:3-12).  Note  that  the 
entries  in  the  behavior  list  contain  pointers  to  the  associated  behavioral  instances  which  in 
turn  contain  pointers  to  the  appropriate  behavioral  execution  routine. 
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Figure  12.  Interrelationship  of  VHDL  Simulation  Data  Structures  (8:3-14) 


32.12  Sequential  Simulation  Cycle.  Figure  13  shows  the  VHDL  simulation 
cycle  used  in  VSIM’s  sequential  mode.  The  simulation  main  loop  consists  of  calls  to  a 
series  of  four  specialized  routines  represented  by  the  circles  in  the  figure.  Specifically,  the 
primary  simulation  routines  are  (4:25, 8:3-15): 


•  KsMCttt*  B«iuiTlors  -  Executes  behaviors  on  the  behavkH-  list  and  removes 
them  fimn  die  list  This  is  where  the  simulation  hx^  begins. 

•  Post  -  Fm*  each  behavior  that  is  executed,  this  routine  posts  the  correqionding 
event  to  the  active  rectnd  list 

•  Got  irf>w  vias  -  Extracts  the  entries  from  the  active  leccwd  list  with  the 
lowest  next-event  time  and  updates  the  simulatitm  clock  to  this  lowest  time  value. 

•  coopss*  vaiuss  For  each  entry  that  is  removed  from  the  active  record 

list  this  routine  compares  the  new  data  value  in  die  active  record  list  entry  with 
the  current  data  value  stored  in  memory  for  that  signal.  If  there  is  no  change  in 
data  values,  the  event  is  ignored.  If  tlrere  is  a  change  in  data  values,  the  affected 
dependent  behaviors  are  added  to  the  behavitx  list  and  scheduled  for  execution 
during  the  next  simulation  cycle. 

As  stated  previously,  all  behaviors  are  adtted  to  the  behaviOT  list  at  simulation  startup 
and  scheduled  for  execution  at  simulation  time  t  =  0.  In  addition,  the  current 
implementation  of  VSIM  requires  that  all  input  signal  changes  be  explicidy  defined  in  the 
VHDL  source  code.  For  example,  complex  processes  which  randomly  generate  a  new  set 
of  input  signals  during  the  course  of  the  simulation  are  not  supported  by  VSIM.  As  a 
result  of  this,  all  input  signal  changes  (i.e.,  those  specified  in  the  testbench)  are  added  to 
the  active  record  list  at  simulation  startup  (4:26). 

The  simulation  loop  shown  in  Figure  13  continues  until  both  the  active  record  list  and 
the  behavior  list  remain  empty  for  an  entire  simulation  cycle,  at  which  point  the 
simulation  is  complete  (4:26). 

3. 2.1. 3  Handling  Behavior  Delays.  Each  behavior  execution  routine  has  a 
nonnegative  logical  delay,  tdelay^  associated  with  it.  When  a  behavior  has  a  change  on 
one  of  its  input  lines  at  time  t,  it  is  scheduled  fm  execution  at  time  t.  When  a  behavior  is 
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Figure  13.  The  Sequential  VHDL  Simulation  Cycle  (8:3*15) 

removed  from  the  behavior  list  and  executed  at  time  r,  the  resulting  output  signal  is 
posted  to  the  active  record  list  with  a  time-stamp  of  r  +  tdelay  (4:27). 

If  a  behavior  has  n  input  signals  that  change  value  at  time  t,  the  behavior  will 
execute  n  times  and  the  resulting  output  signal  will  be  posted  to  the  active  record  list  n 
times.  In  this  situation,  the  correct  event  will  always  be  the  one  corresponding  to  the  most 
recent  behavior  execution.  As  a  result,  whenever  a  new  event  has  the  same  behavior  id 
and  time-stamp  as  another  event  on  the  active  record  list,  the  new  event  replaces  the  old 
event  (4:27). 

There  are  two  types  of  delay  in  VHDL  -  transport  delay  and  inertial  delay.  With 
transport  delay,  the  output  is  a  function  of  the  input  signals  regardless  of  how  long  the 
input  signals  hold  their  values.  With  inertial  delays,  however,  the  output  function  requires 
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that  the  input  signals  maintain  their  values  for  a  time  equal  to  tdelay  before  the  output 
signal  may  change.  This  may  result  in  an  entry  being  removed  firom  the  active  recmd  list 
if  an  input  signal  is  not  held  constant  for  the  required  time  (4:27).  For  example,  consider 
the  AND  gate  in  Figure  14.  A  time  t  =  3  ns,  both  input  signals  are  ‘1’,  and  an  event  for 
signal  Out  to  go  from  ‘0’  to  ‘1’  at  time  t  =  6ns  is  added  to  the  active  record  list  based 
upon  the  output  function  of  Out  =  Ini  AND  In2  trfter  tdelay-  However,  at  time  t  =  5  ns, 
input  signal  In2  returns  to  ‘0’  before  signal  Out  has  changed  values.  As  a  result,  the  event 
for  signal  Out  at  time  t  =  6  ns  is  removed  from  the  active  record  list  (4:28). 

32.2  SPECTRUM  Testbed.  As  shown  in  Figure  11,  VSIM  runs  over  the 
SPECTRUM  parallel  simulation  testbed.  SPECTRUM  allows  the  application  to  be 
parallelized  by  dividing  it  into  logical  processes  (LPs).  The  core  of  SPECTRUM,  its  “LP 
manager”  (file  ip_man.c),  manages  communications  between  LPs.  Function  calls  to 
ipjtnan .  c  are  intercepted  by  a  “filter”  which  provides  the  specific  PDES  synchronization 
protocol  being  used  (4:30).  In  the  case  of  VSIM,  the  filter  vhdiciocics .  c  implements  the 
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Figure  14.  AND  Gate  with  Inertial  Delay  (4:29) 
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Logical  Piocess  (LP) 


Active  Record 
List 


Figure  15.  LP  Message  Receipt  (4:35) 

conservative  Chandy-Misra  null  message  PDES  synchronization  protocol.  Below 
SPECTRUM’S  LP  manager,  the  package  cube2 .  c  provides  the  interface  to  the  hypercube 
operating  system  (4:31).  As  indicated  by  Figure  11,  each  LP  contains  its  own  copy  of 
both  the  application  code  and  the  SPECTRUM  code.  This  includes  the  LP  manager,  the 
protocol  filter,  and  the  operating  system  interface. 

As  shown  in  Figure  15,  SPECTRUM  maintains  its  own  queue  for  incoming  messages 
destined  for  a  given  LP.  Messages  are  stored  in  the  SPECTRUM  queue  until  requested  by 
the  VSIM  application.  When  VSIM  checks  for  incoming  events,  messages  on  the 
SPECTRUM  queue  are  removed  by  the  “receive  filter.”  The  receive  filter  eliminates 
synchronization  control  messages  (e.g.,  null-messages)  and  places  only  valid  events  in  the 
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IP’s  active  lecmd  list  (4:35).  The  “safe  time”  is  used  to  control  which  events  the  LP  may 
safely  process  and  will  be  discussed  further  in  the  next  section. 


The  primary  functions  provided  by  SPECTRUM’S  LP  manager  are  (4:30): 

•  lp_inlt()  -  Initializes  the  LP  and  builds  the  appn^miue  filter  tables. 

•  lp_9«t_«v«&t  0  Retrieves  top  event  from  SPECTRUM  message  queue. 

•  ip_po8t_«v«nt()  -  Sends  event  to  the  owning  LP  of  the  affected  behavior. 

•  lp_advazic«_tiflM()  -  Advances  the  value  of  the  local  simulation  clock. 

•  lp_t«xmlnat«()  -  Teiminates  the  simulation. 


32.3  Parallel  VSIM  Implementation. 

323.1  Parallel  Simulation  Cycle.  When  the  VHDL  behaviors  are  partitioned 
among  multiple  LPs,  each  LP  assumes  “ownership”  for  those  LPs  in  its  partition.  In 
VSIM’s  parallel  mode,  each  LP  executes  the  sequential  simulation  cycle  of  Figure  13  for 
the  behaviors  it  owns.  However,  because  a  signal  change  on  one  LP  may  affect  behaviors 
owned  by  another  LP,  the  simulation  cycle  must  be  modified  to  allow  an  LP  to  forward 
signal  changes  to  other  LPs. 

Figure  16  shows  the  simulation  cycle  modified  for  parallel  simulation.  As  before,  all 
behavior  outputs  are  posted  to  the  local  active  record  list.  When  a  signal  record  is 
removed  from  the  active  record  list  and  the  value  comparison  determines  that  a  signal 
change  has  occurred,  only  the  affected  behaviors  owned  by  the  local  LP  are  posted  to  the 
behavior  list.  If  the  signal  change  affects  behaviors  owned  by  another  LP,  an  event  is 
created  and  sent  to  the  owning  LP  where  it  is  eventually  placed  in  that  LP’s  active  record 
list.  In  this  manner,  only  the  results  of  behavior  executions  that  result  in  actual  signal 
changes  are  forwarded  to  the  affected  LPs  as  event  messages.  Figure  17  shows  the 
parallel  simulation  cycle  for  two  LPs. 
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3232  Synchronization  Protocol.  VSIM  has  been  designed  eo  execute  in  a 
multi-processor,  message-passing  system  with  no  shared  memory.  Running  over  the 
SPECTRUM  testbed,  each  LP  executes  an  identical  copy  of  the  simulation  on  a  disjoint 
subset  of  the  data  (i.e.,  behaviors).  Such  a  situation  is  referred  to  as  a  single 
program/multiple  data  (SPMD)  configuration  (4:33).  In  this  environment,  the  paraUel 
simulation  cycle  of  Figure  17  presents  an  inter-LP  synchronization  problem  caused  by 
dependencies  between  behaviors  owned  by  different  LPs.  As  an  example,  consider  the 
hypothetical  2  LP  partition  for  an  edge-triggered  D  flip-flop  shown  in  Figure  18.  Each 
LP  maintains  a  separate  simulation  clock  which  maintains  that  LP’s  local  virtual  time 
(LVT),  and  indicates  how  far  along  in  the  simulation  each  LP  has  proceeded  to.  In  the 
case  of  Figure  18,  LPl  only  has  four  behaviors  to  execute  as  compared  to  six  for  LPO. 
Thus,  it  is  reasonable  to  expect  that  LPl  may  proceed  at  a  faster  rate  than  LPO.  Let  to  and 
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Figure  17.  The  Parallel  VHDL  Simulation  Cycle  for  a  Two  LPs  (4:34) 


t]  be  the  LVT  of  LPO  and  LPl  respectively,  such  that  to  <tj.  In  this  situation,  it  is 
possible  for  LPO  to  send  LPl  an  event  message  with  a  time  that  is  in  LPl ’s  past 

As  discussed  by  Breeden,  there  are  two  basic  approaches  to  handling  this 
synchronization  problem.  In  the  optimstic  approach,  LPl  would  be  rolled  back  in  virtual 
time  to  a  previously  saved  state  at  a  time  equal  or  prior  to  the  time  of  the  incoming 
message  from  LPO.  The  protocol  gets  its  name  from  the  fact  that  LPl  would  be  allowed  to 
proceed  as  fast  as  possible  with  the  optimistic  assumption  that  it  will  not  get  any  late 
messages  from  LPO.  In  the  conservative  approach,  LPl  is  not  allowed  to  advance  its 
simulation  clock  to  time  unless  it  is  guaranteed  that  it  will  not  receive  any  messages 
with  a  time  stamp  less  than  tj  (4). 

Breeden  discusses  the  mechanics  of  the  conservative  Chandy-Misra  null-message 
synchronization  protocol  used  in  this  thesis  research  (4:11-13,  31-33).  Since  the  null 
messages  add  directly  to  the  communications  overhead  of  the  partition,  the  rules  for  their 
transmission  are  reviewed  in  the  next  section. 
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Figure  18.  Hypothetical  2  LP  Partition  for  Edge-Triggered  D  Flip-Flop 


32  JJ  Null  Messages.  In  order  to  discuss  the  rules  by  which  null  messages 
are  transmitted,  die  following  variables  must  be  defined  (4:32): 

•  tnuU  -  time  stamp  of  the  null  message. 

*  tneq  ~  lowest  time  stamp  of  all  events  in  an  LP’s  active  record  list. 

*  tdeiay  '  logical  output  delay  of  an  LP  for  a  given  output  line. 

•  tsefe  -  local  virtual  time  (LVT)  that  jui  IP  may  “safely”  approach. 

To  maintain  synchronization,  each  directed  communication  link,  or  line^,  between 
LPs  has  a  clock  associated  with  it  which  tracks  the  time-stamp  of  the  most  recent  message 
transmitted  over  that  line.  By  definition,  once  a  message  is  transmitted  over  a  line  with 
time-stamp  r,  it  is  impossible  for  a  message  to  be  transmit^  over  that  same  line  with  a 

^Thetenn/M«  is  used  to  refer  to  the  direcied  arcs  in  the  LPconectivitygnqih.  This  is  not  to  be  confused 
with  the  arcs  in  the  problem  graph  which  represent  inter-behavior  dqiendencies.  Given  N  LPs,  each  LP  can 
have  at  most  (N-1)  output  lines,  but  each  output  line  may  consist  of  multiple  inter-behavior  arcs. 
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time-stamp  less  than  t.  The  safe-time,  is  the  minimum  time  oi  all  input  lines,  and 
represents  the  maximum  time  to  which  the  LP  may  approach  with  the  guarantee  that  no 
messages  will  be  received  with  earlier  time-stamps.  As  shown  in  Hgure  3.15,  the  safe¬ 
time  is  updated  each  time  an  LP  receives  a  message  from  another  LP.  This  is 
accomplished  by  comparing  the  updated  time-stamps  of  each  input  line  to  find  the 
minimum.  Null  messages  are  used  to  advaiK:e  the  times  associated  with  each  line  (thus 
advancing  and  otherwise  contain  no  useful  information  (4:32). 

As  implemented  in  the  filter  vhdiciocks.c,  null  messages  are  sent  under  the 
following  circumstances  (4:32-33): 

•  By  definition,  an  LP  is  constrained  to  process  events  in  non-decreasing  order  of 
simulation  timestamps.  As  such,  when  an  LP  processes  an  event  at  time  r,  it  is 
guaranteed  that  it  will  not  process  an  event  in  the  future  with  a  timestamp  of  less 
than  t.  Therefore,  when  LPn  sends  an  event  message  over  one  of  its  ouq)ut  lines 
with  a  time-stamp  of  r,  it  sends  a  null  message  over  all  other  output  lines  with  a 
time-stamp  of  r  to  let  all  other  LPs  know  that  they  will  not  receive  an  event 
message  fiom  LPn  with  a  timestamp  less  than  r.  This  allows  them  to  advance  the 
clock  associated  with  the  input  line  from  LPn  to  time  t. 

•  When  there  are  no  event  messages  waiting  in  the  SPECTRUM  input  queue  at  the 

time  of  a  request  by  VSIM,  the  receive  filter  (see  Figure  15)  checks  to  see  if  there 
is  an  event  on  the  local  active  record  list  that  has  a  time-stamp  less  than  or  equal 
to  the  safe  time  (i.e.,  t„tq  ^  tsirfe  )•  If  so,  a  null  pointer  is  returned  to  VSIM  and 
processing  may  continue.  If  not,  the  filter  blocks,  waiting  for  an  incoming 
message  to  advance  the  safe  time.  Prior  to  blocking,  however,  the  LP  sends  a  null 
message  over  each  of  its  output  lines  with  a  time-stamp  equal  to  the  smaller  of  the 
safe  time  plus  the  LP  delay  for  that  output  line,  or  the  time  at  the  top  of  the  local 
active  record  list  (i.e.,  tnull  =  fnin  +  ^delay ).  )  )•  These  null  messages 
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serve  as  guarantees  to  the  receiving  LPs  that  no  events  prior  to  will  be 
received  from  the  sending  LP.  This  allows  the  receiving  LPs  to  update  their  safe 
times,  thus  avoiding  cyclical  waiting  and  preventing  deadlock. 

In  Breeden’s  version  of  this  protocol,  null  messages  were  also  sent  over  each  LP 
output  line  at  simulation  startup  with  a  timestamp  of  that  ouq)ut  line’s  delay  value.  The 
purpose  of  these  null  messages  was  to  advance  the  safe  time  of  each  LP  beymid  zero  so 
that  each  LP  may  begin  processing  events.  These  nuU  messages  have  been  eliminated 
since  all  behaviors  in  the  system  are  automatically  scheduled  for  execution  at  simulation 
startup. 

3^.4  Code  Tranrformation.  Circuit  specific  information  is  contained  in  the 
intermediate  C  code  created  during  the  model  generate  phase  of  compilatimi  as  ^own  in 
Figure  10.  This  circuit  specific  code  includes  routines  to  instantiate  each  behavior  and 
signal  and  to  execute  the  behavioral  output  functions.  Breeden  defines  a  seven  step 
process  by  which  this  intermediate  C  code  is  transformed  into  code  compatible  with 
VSIM  (4:29-30). 

3.3  Partitioning  Requirements 

3.3.1  Load  Balancing.  Load  balancing  is  defined  as  the  degree  to  which  all 
available  processors  are  assigned  an  equal  share  of  the  computation  load  (26:59).  In 
parallel  VHDL  simulations,  the  computation  load  consists  of  processing  signal  changes, 
scheduling  the  affected  behaviors,  and  executing  the  affected  behaviors  (which  in  turn 
may  cause  further  signal  changes).  One  method  of  measuring  load  imbalance  is  to 
calculate  the  difference  in  the  minimum  and  maximum  finishing  times  of  all  processors  as 
a  percentage  of  the  maximum  finishing  time  (26:59): 
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However,  this  measure  of  load  imbalance  is  only  valid  if  all  jnocessors  are  able  to 
perform  their  share  of  the  workload  completely  independent  of  the  others.  In  most 
parallel  simulations,  there  exist  dependencies  between  the  woridoads  that  have  beat 
assigned  to  different  processors.  Fot  example,  consider  the  example  VHDL  simulaticMi  of 
Figure  19  in  which  an  edge-triggered  D  flip-flop  has  been  paititicmed  into  three  LPs.  LPO 
has  ownership  of  eight  behaviors,  versus  one  each  for  LPl  and  LP2.  Intuitively,  LPO  will 
process  a  majority  of  the  signal  changes  and  behavior  execution  routines.  Thus,  LPO  has 
been  assigned  a  clear  majority  of  the  computatimi  workload.  However,  because  of  inter¬ 
behavior  dependencies  which  cross  partition  boundaries,  LPl  and  LP2  cannot  con^lete 
their  share  of  the  coiiq)utation  load  until  they  receive  the  last  event  message  from  LPO.  As 
a  result,  while  LPO  is  processing  a  large  number  of  events,  LPl  and  LP2  spend  time 
blocking  for  inputs  frcwn  LPO.  All  three  LPs  will  finish  at  ^proximately  the  same  rime 
(with  LPO  finishing  slightly  ahead  of  the  other  two),  even  though  LPO  had  a  much  greater 
share  of  the  computation  load. 

For  parallel  VHDL  simulation,  an  alternative  method  of  measuring  load  imbalance  is 
required.  The  method  proposed  in  this  thesis  uses  the  following  definitions: 

bi  =  behavior  i. 

wi  =  relative  computation  cost  of  behavior  i. 

Bq  =  set  of  all  behaviors  assigned  to  LP  q. 

Lq  =  the  total  computation  load  of  LP  q. 

Lmax  =  the  maximum  computation  load  of  all  LPs. 

Lavg  =  the  average  computation  load  of  all  LPs. 
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Figure  19.  Load  Imbalance  Example 


With  these  variables,  the  computation  load  for  each  LP  can  be  calculated  as: 

And  the  load  imbalance  factor,  Hb,  can  be  calculated  as: 

The  relative  computation  cost  of  each  behavior,  wi,  is  actually  composed  of  two 
components:  the  relative  computational  complexity  of  behavior  i,  and  the  relative 
frequency  that  behavior  i  is  executed  during  the  simulation.  In  VSIM,  all  behaviws  are 
simple  VHDL  processes.^  Intuitively,  the  major  difference  in  the  computational  intensity 
of  these  simple  VHDL  processes  is  in  the  number  of  input  signals  to  be  evaluated.  If  the 
number  of  behavior  ir^  ..s  is  kept  relatively  small  (e.g.  ^  4),  these  computatiorti 


^  Simple  boolean  operation  (AND,  OR,  NOT,  assignment,  etc.)  with  a  fmile  number  (A  inputs. 
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differences  will  be  negligible  when  compared  to  the  c<Hnputation  requirements  of  more 
complex  VHDL  processes  (such  as  a  bus  resolution  fiinctimi  with  32  inputs).  As  such,  the 
relative  computational  conq)lexity  of  all  behaviors  representing  simple  VHDL  processes 
can  be  safely  assumed  to  be  equal. 

The  relative  frequency  with  which  a  behaviOT  is  executed  is  heavily  dependent  upon 
the  activity^  of  its  input  signals  during  the  simulation  (23:85).  Experience  has  shown  that 
the  signal  activity  of  a  typical  VHDL  circuit  is  not  evenly  distributed.  However,  pritv  to 
simulation,  no  data  is  available  regarding  signal  activities  (23:85).  Since  no  information 
is  available  to  support  a  specific  behavior  weighting,  all  behaviors  are  assumed  to  be 
equally  affected  by  the  circuit  signal  activity. 

Since  behaviors  cannot  be  differentiated  in  terms  of  computational  complexity  or 
execution  frequency,  all  behaviors  are  evenly  weighted  (wi  =1),  and  all  mtqrping 
decisions  must  be  made  solely  on  the  static  inter-dependency  structure  of  the  VHDL 
circuit  (23:85).  Therefore,  the  computation  load  Lq  is  simply  calculated  as  the  number  of 
behaviors  assigned  to  LP  q. 

3.32  Minimizing  Communications  Costs.  Inter-process  conununications  can  be 
defined  as  the  sending  of  information  from  a  source  process  to  a  destination  process  with 
the  destination  process  requiring  the  sent  information  in  order  to  progress.  In  VSIM,  each 
behavior  is  a  simple  VHDL  process,  and  the  sending  of  a  signal  change  from  the  output 
of  one  behavior  to  the  input  of  another  behavior  is  one  form  of  inter-process 
communications.  In  VSIM’s  sequential  mode,  this  form  of  inter-process  conununications 
consists  of  inserting  a  record  in  the  behavior  list  as  shown  in  Figure  13.  In  VSIM’s 
parallel  mode,  this  form  of  communications  is  identical  if  the  receiving  behavior  is 
owned  by  the  same  LP  as  the  sending  behavior.  However,  if  the  sending  and  receiving 

^  The  activity  of  a  signal  is  defined  as  the  number  of  events  genmted  for  that  signal  during  the  simulation 
(23:85). 
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behaviors  are  owned  by  different  LPs,  the  signal  must  be  sent  as  an  inter-LP  event 
message  using  the  SPECTRUM  layer  as  shown  in  Figure  17.  The  insertitm  of  the  recwd 
into  the  behaviOT  list  occurs  at  the  receiving  LP  after  the  event  message  has  been  removed 
from  the  receiving  LP’s  SPECTRUM  incoming  message  queue.  Thus,  the  sending  and 
receiving  of  inter-LP  event  messages  by  SPECTRUM  represents  a  communications 
overhead  not  present  in  the  sequendal  simulation. 

In  parallel  VSIM,  communications  between  behaviors  that  are  owned  by  the  same  LP 
do  not  represent  additional  communications  overhead  over  the  sequential  version,  and  are 
said  to  have  a  cost  of  zero.  Rather,  the  communications  overhead  that  we  are  interested  in 
minimizing  is  the  inter-LP  event  message  traffic  that  occurs  when  communicating 
behaviors  are  owned  by  different  LPs.  Throughout  the  remainder  of  this  thesis,  the  term 
inter-process  communications  will  be  used  to  describe  this  message  level 
communications  between  the  logical  processes.  With  the  previous  assumption  of  one  LP 
per  physical  processor,  this  inter-process  communications  is  also  equivalent  to  the  inter¬ 
processor  communications,  and  the  two  terms  can  be  used  inter-changeably  to  refer  to 
the  communications  overhead  of  the  parallel  simulation. 

3. 3. 2.1  Modeling  Inter-Process  Communications.  The  model  used  to 

represent  inter-process  communications  used  in  this  thesis  makes  use  of  a 
Communications  Weight  Matrix,  and  the  following  definitions: 

Wij  =  the  relative  cost  of  communications  between  processor  i  and  processor  j  based 
upon  the  topological  layout  of  the  processors, 
ay  =  the  number  of  directed  inter-behavior  dependencies  (i.e.  arcs)  between  LPi 
and  LPj. 

With  these  definitions,  the  total  cost  of  directed  communications  from  LPi  to  LPj,  Cy,  can 
be  calculated  as: 
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Ideally,  each  arc  that  makes  up  the  factor  aij  would  be  multiplied  by  a  facUM’  that  accounts 
for  the  frequency  of  the  communications  over  that  arc.  However,  the  frequency  of 
communications  is  dependent  uprni  the  signal  activity,  and,  as  stated  in  section  3.3.1, 
there  is  no  information  available  regarding  signal  activity  prior  to  simulation  (23:85). 
TherefOTe,  the  communications  costs  must  be  estimated  based  upon  the  known  structural 
dependencies  that  cross  the  LP  boundaries.  In  addition,  the  assumption  that  LPO  will  be 
mapped  to  processor  0,  LPl  will  be  mapped  to  processw  1,  etc.,  is  implicit  in  the 
equation  for  Qj. 

Letting  n  be  the  number  of  logical  processes  (LPs),  the  inter-process  cmnmunications 
can  be  represented  by  an  n  x  n  communicadons  weight  matrix  as  shown  in  Figure  20.  The 
main  diagonal  entries  represent  the  cost  of  an  LP’s  communicadons  with  itself.  As  stated 
previously,  these  communicadons  do  not  involve  the  sending  and  receiving  of  event 
messages,  and  thus  do  not  contribute  to  the  parallel  simuladon  communicadons  overhead. 
Therefore  all  main  diagonal  entries  are  always  0. 

Letting  He  represent  the  overall  total  inter-process  communicadons  costs,  it  can  be 
calculated  as  the  sum  of  the  entries  in  the  communicadons  weight  matrix: 

n-i  n-y 
i»0  /»0 

With  the  excepdon  of  a  constant  factor  of  1/2,  this  equadon  is  idendcal  to  the  equadon 
used  by  Bultan  to  calculate  the  communicadons  cost  sub-funedon  He  in  the  mean  field 
annealing  algorithm  discussed  in  secdon  2.4.11  (5:296). 

Upon  further  examinadon,  however,  a  potential  problem  arises  because  He  does  not 
give  an  accurate  picture  of  the  total  inter-process  communicadons  relative  to  the  total 
signal  change  aedvity  of  the  circuit  (as  esdmated  by  the  number  of  inter-behavior  arcs). 
For  example,  if  circuit  A  has  a  total  of  1,000  inter-behavior  arcs  with  100  crossing  LP 
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Figure  20.  Communications  Weight  Matrix  for  n  LPs 


boundaries  and  circuit  B  has  a  total  of  10,000  inter-behavior  arcs  with  500  crossing  LP 
boundaries,  circuit  B  will  have  a  higher  He  than  circuit  A  even  though  it  has  a  smaller 
percentage  of  its  arcs  crossing  the  LP  boundaries.  To  account  Tor  this,  it  is  desirable  to 
calculate  the  total  inter-process  communications  costs  as  a  percentage  of  the  total 
communications  costs  in  the  system.  Thus,  the  equation  for  He  is  modified  as  follows: 

n-1  <1-1 

c. 

1 1  ^  /»0  JmQ 

num.arcs 

where  num_arcs  is  the  total  number  of  inter-behavior  arcs  in  the  system.  Minimizing  the 
value  of  He  is  one  of  the  primary  goals  of  the  partitioning  algorithms  implemented  for 
this  thesis. 


3.32.2  Distribution  of  Communications.  In  addition  to  the  amount  of  inter¬ 
process  communications,  the  distribution  of  those  communications  may  also  add  to  the 
overhead  of  the  parallel  simulation.  For  example,  consider  the  four  LP  examples  of 
Figure  21  in  which  it  is  assumed  that  all  LPs  are  assigned  an  equal  share  of  the 
computation  workload.  In  Figure  21. a,  LP2  and  LP3  run  independently  of  all  other  LPs, 
while  there  is  a  relative  communications  cost  factor  of  100  from  LPO  to  LPl.  As  a  result, 
LP2  and  LPS  will  finish  in  minimal  time  while  LPl  cannot  finish  ahead  of  LPO  because 
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Figure  21.  Communications  Distribution  Example 

of  the  inter-LP  dependencies.  In  this  case,  the  communication  costs  between  LPO  and  LPl 
form  a  bottleneck  which  holds  up  overall  simulation  completionio.  In  Figure  21. b,  the 
problem  has  been  repartitioned  in  order  to  split  up  the  communications  bottleneck 
between  LPO  and  LPl  at  the  expense  of  a  higher  total  conununications  cost.  Although  the 
overall  communications  costs  are  higher,  the  simulation  should  reach  completion  faster 
because  of  the  reduced  bottleneck  in  the  communications  costs. 

The  results  presented  in  Chapter  5  show  that  in  some  circumstances,  it  is  possible  to 
partition  a  circuit  such  that  the  total  amount  of  inter-process  conununications  is  reduced 
by  more  than  33%,  but  the  resulting  simulation  performance  is  worsened  with  strong 
evidence  that  this  is  due  to  a  bottleneck  in  the  distribution  of  the  remaining  inter-process 
communications.  As  a  result,  the  calculations  for  the  total  conununications  cost  overhead 
in  this  thesis  include  an  additional  factor  in  order  to  account  for  the  distribution  of  the 
inter-LP  dependencies  in  the  circuit  partition. 


Simulation  completion  is  deflned  as  the  completion  time  of  the  slowest  logical  process. 
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The  calculation  of  the  communicaticms  distribution  factor  Hd  uses  the  following 
assumptions  and  definitions: 

•  Message  setup  and  transmission  time  is  much  more  significant  than  the  time  it 
takes  to  receive  an  incoming  message.  'Diis  assumptitm  is  based  upcMi  an  analysis 
comparing  the  overhead  involved  in  sending  vs.  receiving  an  inter-processor 
message.  A  message  receive  action  involves  inserting  the  message  in  a  queue 
where  it  is  held  until  actually  needed  by  the  simulation.  On  the  other  hand,  a 
message  send  action  involves  the  message  setup  and  transfer  times,  as  well  as 
blocking  time  while  the  sending  processor  waits  for  a  free  conununications  link. 
Intuitively,  the  sending  processor  has  a  greater  level  of  overtiead.  Instrumentation 
of  the  simulation  is  required  in  order  to  fully  validate  this  assumption. 

•  An  LP’s  contribution  to  the  total  communications  cost  overhead  is  directly 
proportional  to  the  total  weighted  communications  costs  associated  with  all  arcs 
leaving  that  LP.  This  is  represented  by  the  sum  of  the  row  in  the  communications 
weight  matrix  conresponding  to  that  LP. 

Davg  =  the  average  weighted  communications  costs  associated  with  each  LP. 

Dmax  =  the  maximum  weighted  communications  costs  associated  with  an  LP. 

The  communications  distribution  factor  is  then  calculated  as  the  maximum  positive 
deviation  from  the  average  as  a  percentage  of  the  average: 


3.3 2  J  Effect  of  Lookahead.  Simulation  lookahead  is  defined  as  the  ability 
to  predict  what  will  happen  or  not  happen  in  the  future  with  complete  certainty.  A  process 
at  simulation  time  t  with  lookahead  tL  will  be  able  to  accurately  predict  all  events  it  will 
generate  up  to  simulation  time  (t  +  tt)  (11:9).  In  the  case  of  VSIM,  an  LP’s  lookahead 
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refers  to  the  ability  to  predict  a  simulation  time  in  the  future  up  to  which  that  LP  can 
guarantee  that  it  will  generate  no  signal  changes  destined  for  other  LPs.  The  LP’s 
lookahead  is  defined  by  min  ^  idelay  )•  ineqX 

As  discussed  in  section  3.2.3.3,  the  value  t^igiay  is  defined  as  the  minimum  output 
delay  associated  with  each  of  the  LP’s  output  lines.  These  delay  values  are  specified  in  a 
“.arcs”  file  that  is  read  in  by  VSIM  at  runtime.  In  previous  research,  a  uniform  LP  delay 
time  was  used  that  was  equal  to  the  smallest  non-zero  delay  in  the  circuit  associated  with 
a  single  behavior  (4).  In  order  to  guarantee  optimal  LP  delay  values,  however,  this  thesis 
uses  the  minimum  path  from  all  possible  LP  input  lines  and  all  source  behaviors  in  the  LP 
to  each  LP  output  line  in  calculating  the  corresponding  output  line’s  delay  value.  For  a 
random  partition,  this  method  generally  results  in  no  net  improvement  in  the  lookahead  of 
the  circuit.  However,  for  more  sophisticated  partitioning  algorithms,  .arcs  files  with 
larger  LP  delay  values  can  be  obtained.  When  (tse^g  +  tdgiay )  <  a  larger  delay  value 
will  result  in  a  null  message  with  a  higher  time  stamp  being  sent.  In  turn,  this  will  result 
in  fewer  null  messages  being  sent  over  that  output  line,  and  may  allow  the  safe  time  of 
the  receiving  LP  to  advance  at  a  faster  rate.  More  discussion  of  the  effect  of  increasing 
the  lookahead  of  the  circuit  is  presented  in  Chapter  5. 

3.22.4  Null  Messages.  In  VSIM,  the  conservative  Chandy-Misra 
synchronization  protocol  adds  additional  communications  overhead  in  the  form  of  null 
messages  which  transmit  no  useful  information^^  As  discussed  in  section  3.2.3.3,  null 
messages  are  sent  according  to  the  given  set  of  rules  in  order  to  avoid  deadlock  by 
updating  the  safetimes  of  the  receiving  LP. 

Analysis  of  the  rules  for  sending  null  messages  shows  that  the  number  of  null 
messages  transmitted  is  primarily  dependent  upon  the  number  of  arcs  in  the  LP  inter- 

^  ^  As  opposed  to  real  messages  which  transmit  actual  signal  value  information. 
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connectivity  graph  (i.e.  the  number  of  output  lines  as  specified  in  the  associated  .arcs 
file).  This  is  also  equivalent  to  the  number  of  non-zero  entries  in  the  cmnmunications 
weight  matrix.  Other  factors  which  also  have  an  effect  on  the  number  of  null  messages 
sent  are  the  minimum  LP  delay  values  specified  in  the  .arcs  file,  and  the  number  of  inter¬ 
behavior  arcs  which  cross  LP  boundaries. 

The  simulation  results  presented  in  Chapter  S  show  that  the  ratio  of  null  messages  to 
real  messages  in  a  simulation  grows  with  the  number  of  processors,  with  null  messages 
dominating  the  communications  costs  for  partitions  with  more  than  2-4  LPs  for  the 
circui's  used  in  this  thesis.  For  example,  for  the  Wallace  tree  simulation  with  a  random 
partition,  the  null  message  to  real  message  ratio  goes  from  approximately  1:4  with  2  LPs 
to  approximately  5:1  with  8  LPs.  The  equations  in  sections  3.3.2. 1  and  3.3.2.2  for  He  and 
Hd  do  not  directly  account  for  the  transmission  of  the  null  messages  required  to  maintain 
simulation  synchronization.  Therefore,  an  additional  cost  factor  is  needed  to  take  into 
account  this  sizable  overhead. 

Letting  Laies  be  the  amount  of  lookahead  in  the  .arcs  file,  and  Oaics  be  the  number 
of  LP  output  lines  specified  in  the  .arcs  file,  the  null  message  communications  factor  can 
be  defined  as: 

~  ^arcs  ^tres 

The  dependence  of  the  number  of  null  messages  on  the  number  of  inter-behavior 
dependencies  that  cross  LP  boundaries  is  included  in  He,  and  is  not  included  again  in  Hn. 

Given  N  LPs,  each  LP  can  be  connected  to  at  most  N-1  LPs.  Thus,  the  limit  on  the 
value  of  Oarcs  is  given  by: 

S  N(N-1) 

The  value  of  Lares  is  normalized  to  1.0  for  the  case  when  all  LP  delays  in  the  .arcs 
file  are  equal  to  the  smallest  non-zero  behavior  delay  in  the  system.  Letting  the  smallest 


55 


non-zeio  behavior  delay  in  the  system  be  called  the  normal  kxrfcahead,  die  value  of  L^cs 
can  be  calculated  as: 

.  _  nomaal  lookahead 

actual  avenge  lookahead 

If  the  actual  average  lookahead  is  made  to  be  smaller  than  the  normal  value,  L«n:s  will  be 
greater  than  1.0,  thus  conqxiunding  the  impact  of  Oaics  on  the  total  cmnmunicatimis 
costs.  Oi  the  other  hand,  if  the  actual  average  lookahead  is  larger  than  the  normal  value. 
Laics  will  be  smaller  than  1.0  and  the  effect  of  Oaics  will  be  abated. 

It  should  be  noted  that  because  the  LP  delay  is  not  always  used  in  computing  the 
timestamp  of  a  null  message  (i.e.  when  +  tdeiay ),  the  value  of  Laics  will  cmly 

provide  an  estimate  of  the  effect  of  increased  lookahead  on  the  null  message  overhead. 
For  this  reason,  a  better  estimate  of  the  impact  of  increased  lookahead  can  be  calculated 
by  making  a  reasonable  assumptimi  regarding  the  percentage  of  time  that  the  delay  value 
determines  the  timestamp  of  a  null  message.  Hie  assumption  used  in  this  thesis  is  50%, 
resulting  in  the  lookahead  increase  being  cut  in  half.  It  should  be  noted  that  the  uneven 
distribution  of  signal  activity  in  the  circuit  will  affect  the  accuracy  of  this  assumption. 
However,  chapter  5  includes  data  on  the  effect  of  increased  lookahead  which  shows  this 
assumption  to  provide  good  estimates  under  most  circumstances  The  equation  of  Laics  is 
adjusted  as  follows: 

^  _ _ nonn_  lookahead _ 

^  norm,  lookahead  -  avg_  lookahead  i  ^  j 

I - ^ — - J  +  avg_  lookahead 

3.33  Balancing  Load  Imbalance  and  Communications  Costs.  Intuitively,  there  is 
a  natural  conflict  between  the  partitioning  goals  of  minimizing  the  inter-process 
communications  costs  and  assigning  each  LP  an  equal  share  of  the  computation  load.  For 
example,  by  assigning  all  10  behaviors  in  Figure  18  to  LPO,  the  inter-process 
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communications  costs  can  be  elinoinated  completely.  However,  the  problem  is  reduced  to 
a  sequential  simulation  with  LPO  carrying  the  entire  confutation  load  while  LPl  sits  idle. 
On  the  other  hand,  making  the  partition  as  shown  in  Figure  18,  the  simulation  is  closor  to 
being  balanced  in  terms  of  computation  load  at  the  price  of  adding  inter-process 
communications  costs. 

The  primary  goal  of  any  partitioning  algorithm  is  to  strike  a  balance  between  these 
two  conflicting  goals  that  results  in  a  good  simulation  perfcmnance.  The  exact  nature  of 
this  balance,  however,  is  dependent  upon  the  relative  performance  of  the  CPU  and 
communications  subsystem  of  the  hardware  platfcmn  being  used. 

3.3.4  Measuring  the  Cost  of  a  Partition.  Measurement  of  the  simulation 
performance  resulting  from  a  given  partition  is  the  ultimate  method  of  determining  the 
quality  of  a  partition.  However,  many  partitioning  algorithms,  including  the  one 
implemented  in  this  research,  requite  an  interim  assessment  of  the  quality  of  the  partition 
at  various  points  in  the  algorithm.  To  achieve  this  interim  assessment,  an  objective  cost 
function  is  used  similar  to  the  one  used  for  the  mean  field  annealing  algorithm  described 
in  section  2.4.11. 

33.4.1  Objective  Cost  Function.  The  objective  cost  function  is  composed  of 
the  factors  for  load  imbalance  and  communications  costs  discussed  in  sections  3.3.1  and 
3.3.2.  Specifically,  the  objective  cost  function  H  is  defined  as: 

H  =  +  aH, 

where  a  and  B  are  constant  coefficients  which  control  the  relative  influence  of  the 
communications  and  load  balancing  portions  of  the  equation. 
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33.42  Relationship  to  Simulation  Performance.  One  of  the  objectives  of  this 
research  was  to  quantify  the  relationship  between  the  quality  of  a  partition  and  the 
resulting  performance  of  the  simulation.  The  peifonnance  of  the  simulation  is  measured 
in  terms  of  speedup,  which  is  defuied  as  the  ratio  of  the  execution  time  on  a  single 
processor  to  the  execution  time  on  P  processors  (26:65): 


Speedup  =  Sp  = 

^Ppnettaon 


Ideally,  the  time  to  execute  on  P  processes  will  be  1/Pth  the  time  to  execute  on  a 
single  processor,  giving  a  speedup  of  P: 


'ontpmmtor 


(t. 


OMpnCMWt 


P 

1 
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In  general,  however,  the  simulation  overhead  will  prevent  the  achievement  of  a  speedup 
of  P  on  P  processors.  As  discussed  previously,  tte  simulation  overhead  is  directly  related 
to  the  quality  of  the  simulation  partition  as  measured  by  the  objective  cost  function  H.  It 
is  possible  to  estimate  the  expected  simulation  speedup  by  relating  the  cost  functim  H  to 
the  speedup  Sp  as  follows: 


P  P 

'  "  1  +  rH  °  1  +  +  +  oH,] 

where  y  is  a  constant  coefficient.  Note  that  in  a  perfect  partition  (i.e.  no  inter-processor 
communications  and  equally  balanced  loads),  the  cost  function  will  equal  0  and  the 
speedup  will  be  equal  to  the  number  of  processors.  In  addition,  note  that  if  (1  +  y  H) 
increases  at  a  faster  rate  than  P,  then  the  estimated  speedup  will  decrease  as  the  number  of 
processors  is  increased. 
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3.4  Partitioning  Approach 

The  primary  partitioning  algmithm  implemented  in  this  thesis,  referred  to  as  AB- 
Annealing,  draws  upon  several  properties  of  the  partitioning  strategies  presented  in 
section  2.4.  AB-Annealing  is  a  muld-phased  partitioning  strategy  with  the  following 
steps: 

•  The  graph  is  partitioned  into  strong-components. 

•  Treating  the  strong  components  as  indivisible  blocks,  the  graph  is  divided  into  the 
required  number  of  LPs  using  a  deterministic  graph-traversal  algorithm. 

•  Given  the  initial  partition  fnmi  the  previous  step,  those  behaviws  that  lie  on  the 
boundary  between  two  LPs  are  considered  for  reassignment  to  a  different  LP  on  a 
priority  basis  based  upon  the  potential  reduction  in  the  objective  cost  function. 

The  phased  approach  to  the  mapping  problem  used  here  is  similar  to  the  phased 
approaches  used  in  the  Two-Dimensional  Algorithm  of  section  2.4.5  (22)  and  Algorithm 
H  of  section  2.4.7  (16).  The  initial  partition  step  uses  a  simple  graph-traversal  algorithm 
which  has  many  similarities  with  the  Depth-First  Breadth-Next  algorithm  of  section  2.4.8 
(17).  The  flnal  step  implements  a  **botder-annealing”  algorithm  in  an  attempt  to 
iteratively  improve  the  partition  by  making  behavior  reassignments  that  result  in  a 
decrease  in  the  objective  cost  function.  The  objective  cost  function  and  iterative  nature  of 
this  phase  make  it  similar  to  a  deterministic  version  of  the  Mean  Field  Annealing 
algorithm  of  section  2.4.1 1  (5).  However,  the  fact  that  only  those  behaviors  that  lie  on  the 
border  between  two  LPs  are  considered  for  reassignment  also  makes  this  phase  similar  to 
the  Two-Dimensional  Algorithm  of  section  2.4.5  (22). 

3.4.1  Strong  Components.  The  first  step  of  the  AB-Annealing  process  involves 
Hnding  the  strongly  connected  components  of  the  problem  graph.  Each  strong  component 
in  the  problem  graph  corresponds  to  a  complete  feedback  loop  in  the  VHDL  circuit.  This 
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Figure  22.  Strong  Conoponent  Example  -  Sinq>]e  Latch  Feedback  Loop 


step  is  included  in  order  to  istdate  the  feedback  loops  in  a  single  LP  during  the  initial 
partition.  An  exaiiq)le  strong  component  ccuomon  in  digital  circuits  is  shown  in  Figure 
22.  The  problem  graph  in  the  example  represents  a  simple  latch  with  a  two-behavior 
feedback  loc^. 


3.42  Initial  Partition.  In  the  Mean  Fiekl  Annealing  algorithm,  the  initial  partidon 
is  determined  with  a  random  function.  In  the  AB-Annealing  algorithm,  it  was  desired  tt> 
use  the  available  knowledge  about  the  stnicture  of  die  circuit  in  an  effort  to  make  a  good 
first-cut  partition.  It  was  anticipated  that  this  ^)proach,  in  addition  to  being  detenninistic, 
would  reduce  the  amount  of  work  necessary  during  die  annealing  phase  as  well  as  lead  to 
a  better  final  partition.  Two  variations  on  the  DFBN  algraithm  of  section  2.4.8  have  been 
implemented  to  serve  as  the  initial  partitioning  routines  for  the  AB-Annealing  algorithm. 

In  the  first  algorithm,  referred  to  as  Simple  Depth-First  (SDF)  partitioning,  the 
problem  gr^h  is  traversed  in  a  depth-first  manner.  When  a  behaviOT  is  visited  for  the  first 
time,  it  is  added  to  the  current  partition  and  marked  so  that  it  wiU  not  be  added  to  any 
other  partitions.  Befine  beginning  a  partition,  its  size  is  pre-determined  using  the  number 
of  unmarked  behavims  divided  by  the  number  of  unfilled  partititms.  If  a  newly  visited 
behavior  is  part  of  a  strong  component,  the  entire  strong  component  is  added  to  the 
current  partition.  When  the  current  partition  is  full,  a  new  partition  is  started. 
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The  second  algorithm,  referred  to  as  Single  Breadth-First  (SBF)  partitimiing,  the 
problem  graph  is  traversed  in  a  breadth-first  manner,  but  the  partitions  are  built  in  the 
same  way  as  with  the  SDF  partitioning. 

For  comparison  purposes,  the  random  partitioning  algorithm  can  also  be  used  to 
generate  the  initial  partition  for  the  AB-Annealing  algorithm.  In  the  random  case, 
however,  the  first  step  of  finding  the  strong  cmnponents  has  ik)  meaning  and  is  omitted. 

3.4.3  Border-Annealing.  The  third  and  final  phase  of  the  AB-Annealing 
algorithm  consists  of  an  iterative  improvement  of  the  initial  partititm  through  selective 
reassignment  of  certain  behaviors  that  lie  on  the  border  of  the  partition.  This  phase  is 
referred  to  as  “border-annealing,”  and  involves  the  following  steps: 

•  Calculate  the  priority  of  each  behavim*  based  upon  the  following  formula: 

Priority  =  Max_Exleinal_Arcs  -  Local_Arcs 
where  Local.Arcs  represents  the  number  of  input  and  output  arcs  of  the  given 
behavior  that  are  to  or  from  behavimrs  in  the  same  partition,  and  Max_Extemal_ 
Arcs  is  the  maximum  number  of  input  and  output  arcs  of  the  given  behavior  that 
are  to  or  from  behaviors  in  any  single  partition  other  than  the  given  partition. 

•  Place  each  behavior  with  a  priority  ^  0  in  an  annealing  queue  in  decreasing  order 
of  priority.  By  definition,  this  eliminates  from  consideration  all  behaviors  that  are 
not  on  the  border  of  a  partition  (such  behaviors  will  have  a  priority  <  0). 

•  Remove  each  behavior  from  the  queue  in  priority  order  and  evaluate  the  effect  of 
all  possible  moves  on  the  objective  cost  function. 

•  •  Based  upon  the  data  produced  in  the  previous  step,  select  the  best  move  that 

will  not  cause  the  load  delta  factor  to  become  larger  than  the  maximum  load 
imbalance  tolerance  (user  defined  input  parameter).  It  is  possible  that  no  move 
may  be  selected. 
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•  •  If  a  move  is  selected,  carry  out  the  move  and  update  the  data  structures  used  in 

the  calculatitMi  of  the  objective  cost  function. 

•  Repeat  the  above  steps  until  the  maximum  number  of  iteraticms  (user  defined 
input  parameter)  has  been  reached,  or  until  no  more  improvement  can  be  made. 

3.43.1  Selecting  Moves  for  Consideration.  In  the  AB-Annealing  algorithm, 
behavitM^  are  selected  for  potential  LP  reassignment  based  upon  their  priority.  Using  the 
behavior  prioritization  scheme  discussed  above,  behaviors  that  have  no  input  or  output 
arcs  which  cross  an  LP  boundary  will  not  be  considered  for  moving.  In  addition, 
behaviors  which  have  more  input  and  output  arcs  that  stay  within  its  own  LP  than  go  to 
any  single  external  LP  will  also  be  eliminated  from  consideration. 

Figure  23  shows  several  examples  of  behavior  priorities.  In  Figure  23.a,  behavior  3  in 
LPO  has  two  arcs  connected  to  behaviors  in  LPl,  but  only  one  arc  connected  to  a  behavior 
in  its  own  LP.  Thus,  behavior  3  has  a  priority  of  <(-1,  and  will  be  queued  up  for  noove 
consideration.  In  Figure  23.b,  the  two  external  arcs  of  behavior  3  are  connected  to  two 
different  LPs.  The  maximum  number  of  external  arcs  to  a  single  LP  is  1,  and  the  priority 
of  behavior  3  is  0.  In  this  situation,  a  move  can  be  made  to  improve  load  balancing,  or 
some  other  factor,  without  any  net  change  in  the  number  of  inter-LP  arcs.  In  Figure  23.c, 
behavior  3  has  more  arcs  connected  to  behaviors  in  its  own  LP  than  any  other  LP.  Thus, 
its  priority  is  negative,  and  it  will  not  be  placed  in  the  annealing  queue  for  move 
consideration. 

Figure  24  shows  the  actual  SDF  initial  partition  for  the  edge-triggered  D  flip-flop 
circuit.  The  behaviors  that  lie  on  the  boundaries  of  the  partitions  are  highlighted.  The 
priorities  for  each  behavior  are  listed  in  Table  1.  Note  that  in  this  example,  only  behavior 
8  has  a  nonnegative  priority.  Thus,  only  behavior  8  will  be  considered  for  LP 
reassignment. 
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Figure  24.  SDF  Initial  Partition  for  Edge-Triggered  D  Flip-Flop 


Table  1.  Behavior  Priorides  fcnr  the  Partition  of  Hgure  24 


Behavior 

Max  External  Arcs 

Local  Arcs 

f 

4 

-3 

HIHHUHHHIi 

4 

-3 

3 

-2 

HHCHHi 

3 

-2 

5 

I 

3 

-5 

6 

-1 

? 

8 

9 

2 

-2 

Because  a  selected  move  may  effect  the  priority  of  other  behaviors  in  the  annealing 
queue,  it  is  not  possible  to  select  more  than  one  move  at  a  time.  A  behavior  can  only  be 
fully  evaluated  regarding  all  prospective  moves  when  it  reaches  the  top  of  the  annealing 
queue.  The  idea  behind  using  the  annealing  queue  is  to  narrow  down  the  set  of  behaviors 
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that  must  be  evaluated  to  only  those  that  have  a  good  chance  of  qualifying  for  a  move. 
Under  this  approach,  it  is  possible  that  a  selected  move  will  cause  the  priority  of  a  ikmi- 
queued  behavior  to  become  nonnegative.  In  this  situation,  the  affected  behavicn*  would  be 
queued  during  the  next  iteration.  The  alternative  to  this  approach  would  be  to  iterate 
through  all  of  the  behaviors  after  each  move,  selecting  the  single  best  caiulidate  behavior. 
This  approach  has  two  disadvantages.  First,  it  is  computationally  more  expensive, 
requiring  0(NP2)  worst  case  per  potential  move,  vs.  0(P^)  (note  that  N  is  the  number  of 
behaviors  and  P  is  the  number  of  logical  processes).  Second,  a  complex  scheme  of 
tracking  which  behaviors  have  been  considered  for  reassignment  must  be  implemented 
along  with  a  prioritization  mechanism  to  prevent  the  starvation  of  low  priority  behaviors. 

3.4.32  Solution  Convergence.  The  iterative  border-annealing  process 
described  above  continues  until  one  of  two  things  occurs: 

•  The  algorithm  converges  to  a  solution  in  which  no  more  progress  can  be  made. 
This  is  indicated  by  an  entire  iteration  in  which  no  moves  are  selected. 

•  The  maximum  number  of  iterations  is  reached. 

•  A  specified  number  of  consecutive  iterations  are  processed  with  no  net 
improvement  in  the  objective  cost  function. 

To  allow  for  the  evaluation  of  slight  variations  in  the  annealing  process,  three  values 
are  taken  as  modifiable  input  parameters  to  the  annealing  algorithm: 

•  Hum_iteration8  -  defines  the  maximum  number  of  annealing  iterations  to 
perform  before  terminating  the  process. 

•  Max_Northies8_itexations  -  defines  the  number  of  consecutive  iterations 
with  no  net  improvement  in  the  objective  cost  function  which  can  be  processed 
before  the  annealing  process  is  terminated. 
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•  Load_isibai_Toi  -  deHnes  the  maximum  value  of  the  load  delta  factor  that  is 
acceptable.  Moves  which  cause  the  load  delta  factor  to  be  larger  than  the 
Load_iinbai_Toi  will  not  be  made,  even  if  they  lower  the  overall  inter-process 
communications  costs. 

3.4.4  Topological  Variation.  As  discussed  in  section  2.2.1,  topological  variation 
arises  when  the  inter-dependency  structure  of  the  parallel  application  differs  from  the 
inter-connectivity  structure  of  the  parallel  system  (2).  In  the  case  of  VSIM,  grouping  the 
circuit  behaviors  into  logical  processes  (LPs)  does  not  eliminate  the  problem  of 
topological  variation,  as  the  inter-connections  between  the  LPs  may  still  differ  from  the 
inter-connectivity  structure  of  the  hypercube. 

Figure  25  shows  the  interconnection  structure  of  an  8  node  hypercube.  In  a  hypercube 
architecture,  each  of  the  P  processors  has  log2  P  nearest  neighbor  processors  with  which 
it  shares  a  direct  communications  link.  In  order  for  a  processor  to  communicate  with  a 
processor  that  it  is  not  directly  connected  to,  the  communications  must  be  routed  via  one 
or  more  intermediate  processors.  In  theory,  the  longer  the  communications  path  and  the 
greater  the  number  of  intermediate  routing  processors,  the  higher  the  communications 
costs.  As  a  result,  it  is  desirable  to  assign  the  LPs  to  physical  processors  in  such  a  way 
that  the  number  of  inter-LP  connections  which  correspond  with  single-hop  (i.e.,  direct) 
physical  connections  is  maximized. 

To  address  these  topological  concerns,  two  additional  features  have  been  added  to  the 
paritioning  algorithm  as  menu  selectable  options.  The  first  deals  with  the  order  in  which 
the  LPs  are  built  during  the  SDF  and  SBF  initial  partitions.  This  option  is  based  upon  the 
assumption  that  LPo  will  be  assigned  to  processor  0,  LPi  will  be  assigned  to  processor  1, . 
. . ,  and  LPp-i  will  be  assigned  to  processor  P-1.  When  performing  the  initial  SDF  or  SBF 
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Figure  25.  Topological  Layout  on  Hypercubc  Connectivity  Graph 
partitionings,  the  default  ordering  for  assigning  behaviors  to  LPs  is  to  begin  with  LPq  and 
proceed  in  a  straight  numerical  ordering  to  LI^.j  (e.g.  0-1-2-3-4-5-6-7). 

The  potential  problem  with  this  approach  lies  in  the  fact  that  with  the  SDF  and  SBF 
partitionings,  there  is  a  tendency  for  LPn  to  have  a  large  number  of  connections^^  to 
LPn+i.  HowevCT,  in  a  hypercube  architecture,  direct  connections  are  determined  by  the 
binary  representation  of  the  processor  numbers,  and  processors  n  and  n+1  may  not  be 
directly  connected.  Specifically,  only  those  processors  whose  binary  representations 
differ  in  a  single  bit  position  are  directly  connected.  If  processor  n  and  processor  n+1  are 
not  directly  connected,  then  this  situation  will  result  in  an  increase  in  the  amount  of  multi¬ 
hop  communications.  For  example,  if  LPl  has  a  connection  to  LP2  (assigned  to 
processors  1  and  2  respectively),  each  message  from  LPl  to  LP2  will  traverse  two 
physical  communication  links  because  processors  1  and  2  are  not  directly  connected  in  a 
hypercube  architecture.  This  is  show  graphically  in  Figure  25.a. 

In  an  attempt  to  minimize  the  amount  of  multi-hop  communications,  an  alternative  LP 
assignment  ordering  is  used  based  upon  a  path  through  the  hypercube  in  which  each 
subsequent  processor  has  a  binary  representation  that  differs  from  the  previous  one  by  a 


Here,  the  term  connection  corresponds  to  an  inter-behavior  arc  that  crosses  LP  boundaries. 
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single  bit  position  (e.g.,  0-1-3-2-6-7-5-4).  This  is  shown  graphically  in  Figure  2S.b. 
However,  such  a  path  only  exists  if  the  number  of  processors  is  a  power  of  2.  In  those 
circumstances  where  the  number  of  processors  is  not  a  power  of  2,  the  “extra”  processors 
will  be  assigned  in  a  straight  numerical  order.  For  example,  the  assignment  ordering  for 
12  processors  will  be  0-1-3-2-6-7-5-4-8-9-10-11.  Note  that  in  a  random  partition,  this 
option  has  no  meaning  and  is  ignored. 

The  second  feature  which  has  been  added  as  a  option  to  address  topological  concerns 
is  the  ability  to  weight  inter-LP  arcs  based  upon  the  number  of  hops  in  the  corresponding 
physical  communications  link.  When  activated,  this  option  will  have  no  effect  upon  the 
initial  SDF  or  SBF  partitions,  but  will  influence  behavior  reassignment  decisions  in  the 
border-annealing  process.  The  net  effect  will  be  a  tendency  to  reduce  the  number  of 
multi-hop  communications  at  the  potential  cost  of  an  increase  in  the  amount  of  single-hop 
communications. 

3.5  Summary 

This  chapter  discussed  the  partitioning  objectives  of  achieving  a  balanced 
computation  load  while  minimizing  the  inter-process  communications  overhead. 
Equations  for  modeling  the  quality  of  a  partition  as  it  relates  to  these  objectives  was 
proposed.  The  model  proposed  for  measuring  the  inter-process  communications  overhead 
takes  into  account  the  additional  message  traffic  caused  by  the  conservative  null-message 
protocol.  The  communications  overhead  cost  is  combined  with  a  factor  that  accounts  for 
load  imbalance  to  create  an  objective  cost  function  for  the  partition.  A  multi- phased 
partitioning  algorithm  is  then  proposed  with  the  objective  of  minimizing  the  objective 
cost  function. 

The  next  chapter  presents  the  detailed  implementation  of  the  proposed  partitioning 
algorithm,  and  discusses  the  primary  test  cases  used  in  this  thesis. 
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N.  Implementation 


4.1  Overview 

This  chapter  presents  a  high-level  discussion  of  the  implementation  of  the  VHDL 
Graph-Partitioning  Tool  (GP-Tool).  It  also  discusses  the  primary  VHDL  circuits  used  as 
test  cases  in  order  to  validate  the  partitioning  strategies  implemented  as  part  of  this  thesis. 
A  complete  GP-Tool  user’s  guide  and  a  more  detailed  discussion  of  the  GP-Tool 
implementation  can  be  found  in  Appendix  C. 

4  2  VSIM  Graph-Partitioning  Tool  ( GP-Tool) 

The  VHDL  Graph-Partitioning  Tool  (GP-Tool)  implemented  in  this  thesis  is  an 
extension  of  the  partitioning  utility  VHDL  Graph  Searching  Program  implemented 
during  previous  AFTT  research.  It  was  written^^  in  order  to  provide  a  random  distribution 
of  the  circuit  behaviors  among  a  desired  number  of  LPs.  The  following  sections  present 
an  overview  of  the  additional  functionality  added  to  GP-Tool  as  part  of  this  thesis.  More 
information  can  be  found  in  Appendix  C. 

4.2.1  Implementation  Environment.  GP-Tool  is  implemented  in  the  Ada 
progratiuning  language  using  the  Sun  Ada  Compiler,  version  1.1.  The  algorithms 
implemented  in  GP-Tool  are  all  sequential,  with  Sun  woricstations  as  the  target  platfcnm. 
The  original  source  code  was  written  in  a  procedural  fashion  (i.e.  not  object-oriented), 
and  made  heavy  use  of  Ada  generics  to  provide  the  underlying  data  structures  and  data 
structure  manipulation  routines.  The  current  version  of  GP-Tool  is  also  written  in  a 
procedural  fashion.  In  addition  to  allowing  maximum  reusability  of  the  original  source 

By  AFTT  instructor  Maj  Eric  R.  Christensen,  USA. 
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9  ET_DFF_TEST_BENCH (STRUCTURAL)  012 
8  ET_DFF_TEST_BENCH (STRUCTURAL)  0  3 
7  ET_DFF (STRUCTURAL)  0 
6  ETJJFF (STRUCTURAL)  0 
5  NAND_GATE (SIMPLE)  3000000  4  7 
4  NAND_GATE (SIMPLE)  3000000  5  6 
3  NAND_GATE (SIMPLE)  3000000  0  2 
2  THREE_INPUT_NAND_GATE (SIMPLE)  3000000  3  5 
1  NAND_GATE (SIMPLE)  3000000  024 
0  NAND  GATE (SIMPLE)  3000000  1 


Figure  26.  GP-Tool  Input  File  for  Edge-Triggered  D  Flip-Fk^ 


code,  procedural  programming  is  the  preferred  meUuxlology  for  routines  that  are 
computadonally  intensive,  such  as  the  AB  ixnder-annealing  algorithm  (24:24).  Appendix 
C  details  the  relationship  between  the  original  (unnnodified)  Ada  packages,  the  nxxlified 
Ada  packages,  and  the  new  Ada  packages  which  comprise  the  current  version  of  GP- 
Tool. 


422  Input  and  Ou^ut  Files.  At  {q)plication  startup,  GP-Tool  will  prompt  the 
user  for  the  name  of  the  desired  input  file.  The  input  tile  can  be  created  manually  for 
trivially  small  circuits.  For  larger  circuits,  the  VSIM  utility  vmap  can  be  used  to  create  the 
correct  flle  (see  Appendix  B  more  information  on  vmap).  This  Ble  lists  the  behaviors  in 
the  circuit  and  describes  the  behavior  inter-dependency  relationships.  Each  line  in  the 
input  file  must  be  of  the  following  format: 

behav_id  behav_naiiie  b«hav_dalay  [optional  list  of  dapandancias] 
where  all  values  are  nonnegative  integers  except  behav_name  which  is  a  string  of  no 
more  than  80  characters.  Figure  26  shows  an  example  input  file  for  the  edge-triggered  D 
flip-flop  of  Figure  18.  Note  that  behaviors  6  through  9  have  zero  delays. 

There  are  numerous  output  files  that  can  be  produced  by  GP-Tool.  Of  these  files,  the 
two  most  important  are  the  behavior-to-LP  mapping  Ole  (ipx.map)  and  the  logical 
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process  (LP)  dependency  file  (ipx.  arcs)  which  are  required  for  the  parallel  execution  of 
VSIM.  The  Hrst  file  maps  each  behavior  to  an  “owning”  LP,  while  the  latter  file  defines 
the  inter-dependency  relationships  of  the  LPs  in  the  system. 

In  the  original  version  of  GP-Tool,  the  ipx.map  describing  the  randtmi  partition  was 
produced  directly  by  GP-Tool,  but  the  ipx.arcs  file  was  not.  Instead,  an  intermediate 
file  was  produced  which,  along  with  the  ipx.map  file,  was  used  as  the  input  to  a  separate 
utility  application  (buiid_arc)  which  produced  the  appropriate  ipx.arcs  file.  However, 
the  buiici_arc  Utility  was  unable  to  handle  the  large  circuit  description  files  which 
comprised  the  primary  test  cases  used  in  this  thesis.  To  circumvent  this  problem,  the 
current  version  of  GP-Tool  directly  produces  the  ipx.arcs  file  corresponding  to  each 
Ipx.map  file.  The  specific  format  specifications  for  the  ipx.map  and  ipx.arcs  files  are 
discussed  in  section  B.4.1  of  Appendix  B. 

The  other  output  file  which  is  of  primary  interest  is  the  partition  statistics  file  which 
provides  a  large  amount  of  information  about  the  quality  of  the  resulting  partition. 
Among  the  information  reported  is  the  number  of  inter-component  arcs,  the 
communications  cost  factor  (He),  the  communications  distribution  factor  (Hu),  the 
number  of  LP  output  lines  (Oaics),  the  load  delta  factor  (Hb),  the  communications  weight 
matrix,  and  a  list  of  the  behaviors  assigned  to  each  LP.  This  information  is  provided  to 
facilitate  the  comparison  of  the  quality  of  different  partitions.  This  file  is  produced 
automatically  for  each  partition  generated,  but  is  not  required  by  VSIM. 

An  example  partition  statistics  file  is  shown  in  Figure  27.  This  figure  shows  a  Simple 
Depth-First  (SDF)  partition  for  the  Wallace-Tree  multiplier  with  4  LPs.  The  top  portion 
of  the  file  gives  the  general  information  about  the  input  problem-graph.  Specifically,  it 
lists  the  name  of  the  input  file  and  the  number  of  vertices  and  arcs  in  the  problem  grtqrh. 
The  middle  section  lists  detailed  information  about  the  partition,  beginning  with  the  name 
of  the  partitioning  algorithm  used.  The  other  values  presented  are  as  follows: 


71 


•  Muabar  of  cofonuto  -  Gives  tiM  number  of  LPs  in  the  partition. 

•  intor-cooponont  arcs  -  Gives  the  total  number  of  inter-behaviOT  arcs  which 
cross  LP  boundaries. 

•  ii9ht_int«r__L»_Are8  Represents  the  inter-component  arcs  with  each  arc 
multiplied  by  the  hop-weight  of  the  ccvresponding  physical  communications  link. 
This  is  equivalent  to  the  sum  of  the  entries  in  the  cmnmunicatitms  weight  matrix. 
If  all  arcs  are  evenly  weighted  with  a  value  of  1.0,  this  figure  will  be  the  same  as 
the  previous  item. 

•  AvgjNghtjtrcs  -  Represents  the  average  output  arc  weight  for  each  LP 
(wght_inter_LP_Arcs  divided  by  the  number  of  LPs).  This  is  equivalent  to  the 
average  of  the  row  sums  in  the  communications  weight  matrix. 

•  stddov_wghtjOut_Axcs  -  Represents  the  standard  deviaticm  of  the  output  arc 
weights  for  each  LP.  This  is  equivalent  to  the  standard  deviatimi  of  the  row  sums 
of  the  communications  weight  matrix. 

•  iusi«7jrght_out_Arc8  -  Represents  the  maximum  positive  deviation  of  the 
output  arc  weights  for  each  LP.  Hus  is  equivalent  to  the  maximum  positive 
deviation  of  the  row  sums  of  the  conununications  weight  matrix. 

•  stdd«vjifght_in_Arc8  -  Represents  the  standard  deviation  of  the  input  arc 
weights  for  each  LP.  This  is  equivalent  to  the  standard  deviation  of  the  column 
sums  of  the  conununications  weight  matrix. 

•  iuxd*v_iight_in_Axcs  -  Represents  the  maximum  positive  deviation  of  the 
input  arc  weights  for  each  LP.  This  is  equivalent  to  the  maximum  positive 
deviation  of  the  column  sums  of  the  conununications  weight  matrix. 
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GRAPH  INFORMATION 


-  Wallace. vmap 


The  number  of  vertices  in  this  graph  is  :  1050 
The  number  of  arcs  in  this  graph  is  :  1770 

PARTITION  INFORMATION  -  Sin?>le  Depth-First  (SDF)  Partitioning 


Number  of  conponents  : 

Inter-component  arcs  : 

WghtInter_LP_Arcs  : 

Avg_Wght_Arcs  : 

Stddev_Wght_Out_Arcs  : 
Maxdev_Wght_Out_Arcs  : 
Stddev_Wght_In_Arcs  : 

Maxdev_Wght_In_Arcs  : 

Comm_Cost_Factor  : 

Comm_Dist_Factor  : 

LP_Output_Lines  : 

Lookahead_Factor  : 

Avg_Con¥)_Load  : 

Stddev_Con^_Load  : 

Maxdev_Con?)_Load  : 

Load_Delta_Factor  : 

The  LP  loads  are  : 

263  263  262  262 

The  communications  weight  matrix  is  : 


o 

o 

2.0 

o 

o 

5.0 

7.0 

56.0 

0.0 

3.0 

2.0 

61.0 

21.0 

48.0 

0.0 

8.0 

77.0 

50.0 

40.0 

77.0 

0.0 

167.0 

127.0 

90.0 

80.0 

15.0 

=>  312.0 

The  total  partition  cost  is  :  3.65 

The  predicted  speedup  is  :  3.01 

Speedup  prediction  parameters  : 
alpha  :  100.0000000000 
beta  :  1.0000000000 

gamma  ;  0.0900000000 


4 

312 

312.0 

78.0 

66.5 
89.0 

46.6 
49.0 

17.63  % 
114.10  % 
11 

0.667 

262.5 

0.6 

0.5 

0.19  % 


Figure  27.  Wallacc-Trec  SDF  Partition  Statistics  File  for  4  LPs 
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•  coMjCostjractor  -  This  factor  is  the  He  discussed  in  section  3.3.2. 1.  It 
represents  W9ht_inter_LP_Arcs  divided  by  the  total  number  of  inter-behavior 
arcs  in  the  input  problem-graph. 

•  coaajDiatjractor  -  This  factor  is  the  Hd  discussed  in  section  3.3.2.2.  It 
represents  the  difference  between  Maxdev_wght_out_Arc3  and  Avg_wght_Arcs 
divided  by  Avg_Wght_Arc3. 

•  iiP_Otttput_LiiMs  -  This  factor  is  the  Oaics  discussed  in  sectitm  3.3.2.4,  and 
represents  the  number  of  arcs  in  the  LP  connectivity  graph.  It  is  equivalent  to  the 
number  of  non- zero  entries  in  the  communications  weight  matrix  and  the  number 
of  output  lines  specified  in  the  ipx.arcs  file. 

•  Lookah«ad_ractoz  -  This  factor  is  the  discussed  in  section  3.3.2.4,  and 
provides  a  measure  of  the  amount  of  lotdcahead  in  the  ipx .  arcs  file  (the  smaller 
the  value,  the  larger  the  average  lookahead). 

•  AvgjCflap_Load  -  This  factor  is  the  Lavg  discussed  in  section  3.3.1,  and 
represents  the  average  computation  load  of  all  the  LPs.  Since  all  behaviors  are 
equally  weighted,  this  is  equivalent  to  the  number  of  vertices  in  the  problem-graph 
divided  by  the  number  of  LPs. 

•  stddev_co^_Load  -  Represents  the  standard  deviation  of  the  LP 
computation  loads. 

•  iiaxdav_co^p_Lo«d  -  Represents  the  maximum  positive  deviation  of  the  LP 
computation  loads. 

•  LoadjDaitajractoz  -  This  factor  is  the  Hb  discussed  in  section  3.3.1.  It 
represents  Maxdev_Conip_Load  divided  by  Avg_Comp_Load. 

•  If  Loads  -  Lists  the  computation  load  (i.e.  number  of  behaviors)  assigned  to 
each  LP  beginning  with  LPO  and  proceeding  in  numeric  order  from  left  to  right 
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•  rriunli  ■!  ii  inn  Matrix  This  is  the  n  x  n  ctxnmunicatitHis  weight 

matrix  shown  in  Figure  20  with  an  additional  column  to  hold  the  row  sums,  and 
an  additional  row  to  hold  the  column  sums.  The  botttxn  right  entry  holds  the  sum 
of  all  n  X  n  entries  and  coiresponds  to  the  value  Hght_inter_LP_Arcs. 

•  Total  Partition  Coat  -  This  is  the  value  of  the  objective  cost  function  of 
section  3.3.4. 1. 

•  Pradictad  Spaadup  -  This  is  the  value  of  the  estimated  speedup  equation 
of  section  3.3.4.2. 

•  spaadup  Pradiction  Paramatara  -  Alpha  and  beta  represent  the  coefficients 
used  to  balance  the  communications  and  load  imbalance  factors  of  the  objective 
cost  function.  Gamma  is  used  as  the  coefficient  to  the  total  partition  cost  in  the 
speedup  estimate  function. 

The  bottom  portion  of  the  partition  statistics  rile  lists  the  behaviors  assigned  to  each  LP 
and  the  number  of  axs  that  are  local^^  to  that  LP,  and  is  omitted  from  Figure  27. 

4.2.3  Data  Structures.  The  implementation  of  the  partitioning  algorithms  is 
heavily  dependent  upon  the  data  structures  used  to  represent  the  problem-graph  for  the 
circuit  being  simulated.  Several  modifications  to  the  underlying  data  structures  used  in 
the  original  version  of  GP-Tool  have  been  implemented  in  order  to  improve  algorithm 
efficiency  and  to  facilitate  the  implementation  of  more  sophisticated  partitioning 
algorithms. 

In  the  original  GP-Tool,  a  graph  was  represented  by  a  set  of  vertex  records  and  a  set 
of  arc  records.  In  addition,  each  vertex  record  contained  its  own  set  of  arc  records  for  arcs 
that  originated  from  that  venex.  Each  behavior  in  the  circuit  being  simulated  corresponds 


A  local  arc  is  defined  as  one  that  is  between  two  behaviors  assigned  to  the  same  LP. 
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—  Declaration  of  the  Vertex  Nodes  &  Instantiation  of  Generic  Set  pkg 


type  Vertex_Node  is 
record 

The_Item  : 

The_Arcs 

Reference_Count  : 
Next 

end  record; 
type  Vertex  is  access 
package  Vertex_Set  is 


Item; 

Arc_Set . Set ; 
Natural  0; 
Vertex; 


—  set  of  output  arcs 

—  number  of  input  arcs 

—  for  garbage  collection 


Vertex_Node; 

new  Set  Iterator (Item  ■>  Vertex); 


—  Declaration  of  the  Arc  Nodes  £  Instantiation  of  Generic  Set  pkg 


Attribute; 

Vertex;  —  pointer 


type  Arc_Node  is 
record 

The_Att  r ibute 
The_Source 
TheJDest inat ion 
Next 
end  record; 

type  Arc  is  access  Arc_Node; 

package  Arc_Set  is  new  Set_Iterator (Item  -> 


Vertex; 

Arc; 


—  pointer 


to  source  vertex 
to  destination  vertex 


Arc) ; 


—  Declaration  of  the  Graph  Type 

type  Graph  is 
record 

The_Vertices  :  Vertex_Set .Set; 
The_Arcs  :  Arc_Set . Set ; 

end  record; 


Figure  28.  Original  GP-Tool  Graph  Data  Structures 

to  a  unique  vertex  record,  and  each  dependency  between  behaviors  corresponds  to  a 
unique  arc  record. 

The  original  data  structure  declarations  are  shown  in  Figure  28.  With  these  data 
structures,  procedures  were  provided  to  take  a  graph  as  input  and  iterate  through  the  set  of 
vertices  or  the  set  of  arcs  which  comprised  the  graph.  A  procedure  was  also  provided  to 
take  a  single  vertex  as  input  and  iterate  through  its  set  of  output  arcs.  As  shown  in  Figure 
28,  each  arc  maintained  pointers  to  its  source  vertex  and  its  destination  vertex.  Thus,  it 
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—  Declaration  of  the  Vertex  Nodes  &  Instantiation  of  Generic  Set  pkg 

type  Vertex_Node  is 
record 

The_Itein  :  Item; 

The_Arcs  :  Arc_Set.Set;  —  set  of  output  arcs 

Incoming_Arcs  :  Arc_Set.Set;  —  set  of  input  arcs 

Reference_Count  :  Natural  0;  —  number  of  input  arcs 

Parent  :  Vertex;  —  ptr  to  prev  member  of  group 

Child  :  Vertex;  —  ptr  to  next  member  of  group 

Next  :  Vertex;  —  for  garbage  collection 

end  record; 

type  Vertex  is  access  Vertex_Node; 

package  Vertex_Set  is  new  Set_lterator (Item  ->  Vertex); 


—  Declaration  of  the  Arc  Nodes  &  Instantiation  of  Generic  Set  pkg 

type  Arc_Node  is 
record 

The_Attribute  :  Attribute; 

The_Source  :  Vertex;  —  pointer  to  source  vertex 

The_Destination  :  Vertex;  —  pointer  to  destination  vertex 

Next  :  Arc ; 

end  record; 

type  Arc  is  access  Arc_Node; 

package  Arc_Set  is  new  Set_Iterator (Item  «>  Arc); 


—  Declaration  of  the  Graph  Type 

type  Graph  is 
record 

The_Vertices  :  Vertex_Set.Set; 

The_Arcs  :  Arc_Set.Set; 

end  record; 

Figure  29.  Modified  GP-Tool  Graph  Data  Structures 


was  possible  to  traverse  the  graph  in  the  forward  direction  (i.e.  following  arcs  from  tail  to 
head),  but  not  in  the  reverse  direction.  This  is  because  the  set  of  input  arcs  was  not 
maintained  by  each  vertex  record. 

To  alleviate  this  problem,  the  original  version  of  GP-Tool  built  two  separate  graphs; 
an  “adjacency  graph”  which  had  the  arcs  in  their  forward  directions,  and  a  “dependency 
graph”  which  was  identical  except  that  the  direction  of  the  arcs  was  reversed.  This  dual 
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type  Proce33_Node_Type  is 
record 


Proce33_Id 

:  Natural ; 

Label_Nanie 

:  String80.String_Type; 

The_Delay 

:  Natural  0; 

The_LP 

:  Natural  :»  0; 

Group_Size 

:  Natural  1; 

Gr  oup_Nuni_Ar  c  s 

:  Natural  :=  0; 

end  record; 

Figure  30.  Process_Node_Type  Data  Structure 


graph  approach  was  adequate  for  many  applications  (such  as  Hnding  the  strong 
components),  but  it  had  several  disadvantages.  First,  it  doubled  the  amount  of  memory 
required  to  maintain  the  graph  information.  Second,  it  required  twice  as  long  to  build  the 
graph  from  the  input  file.  Third,  and  most  significantly,  there  was  no  way  to 
simultaneously  iterate  both  the  input  and  output  arcs  of  a  given  vertex  without  iterating 
through  the  entire  vertex  set  of  the  dual  graph  to  locate  the  matching  vertex  record.  The 
ability  to  perform  this  last  function  is  critical  to  being  able  to  efficiently  prioritize 
behaviors  for  potential  LP  reassignment  during  the  border  annealing  process. 

To  provide  the  needed  functionality,  the  vertex  record  was  modified  to  include  a  set 
of  incoming  arcs  in  addition  to  the  set  of  outgoing  arcs.  The  appropriate  procedure  to 
iterate  this  new  set  of  incoming  arcs  was  also  added.  Together,  these  changes  obviated  the 
need  to  maintain  two  separate  graphs  in  memory.  The  modified  data  structure 
declarations  are  shown  in  Figure  29. 

In  addition  to  adding  a  set  of  incoming  arcs  to  the  vertex  record,  two  additional  Helds 
(Parent  and  Child)  were  added  to  allow  groups  of  vertices  to  be  linked  together  in  a 
doubly-linked  list  fashion.  These  fields  are  used  to  link  together  the  behaviors  that  have 
been  assigned  to  the  same  LP.  In  doing  so,  it  is  possible  to  quickly  iterate  through  all  of 
the  behaviors  that  have  been  assigned  to  a  given  LP  to  gather  pertinent  information,  such 
as  the  number  of  arcs  that  are  local  to  that  LP,  without  the  need  to  maintain  complex 
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external  data  structures.  To  facilitate  the  recording  of  partition  information  in  the  graph 
data  structure,  each  vertex  keeps  track  of  which  LP  it  is  currently  assigned  to.  In  addition, 
one  vertex  in  each  LP  is  arbitrarily  chosen  to  be  the  “list  head”  for  that  LP,  so  called 
because  its  parent  field  points  to  a  null  vertex  placing  it  at  the  head  of  the  doubly-linked 
list  which  links  together  the  members  of  the  LP.  For  easy  access,  the  size  of  the  LP  and 
the  number  of  local  arcs  for  that  LP  are  recorded  in  the  list  head  (referred  to  as  the  “head 
vertex”  for  that  LP).  To  track  all  of  this  information,  the  record  data  structure  shown  in 
Figure  30,  referred  to  as  the  Process  Node  Type,  is  used  as  the  “Item”  type  in  the 
Vertex_Node  record  of  Figure  29. 

In  addition  to  providing  a  place  to  record  LP  information,  the  Process_Node_Type 
data  structure  is  where  the  behavior  specific  information  is  recorded.  Specifically,  the 
process  id  number,  the  behavior’s  label  name,  and  the  behavior  delay  are  recorded.  It 
should  be  noted  that  to  minimize  data  duplication,  only  the  head  vertex  for  each  LP 
maintains  the  information  regarding  the  LP’s  size  and  number  of  local  arcs.  These  fields 
are  simply  ignored  if  the  vertex  is  not  the  head  vertex. 

4.2 A  Menu  Structure.  The  current  version  of  GP-Tool  uses  a  two-level  menu 
structure.  The  GP-Tool  main  menu  is  shown  in  Figure  31.  Items  1  through  4  allow  the 
user  to  produce  various  output  files  not  directly  related  to  the  partitioning  files,  and  are 
discussed  further  in  Appendix  C. 

Item  5  on  the  main  menu  takes  the  user  to  the  behavior  mapping  sub-menu  shown  in 
Figure  32,  from  which  the  user  can  select  the  desired  partitioning  algorithm  as  well  as 
modify  various  user  defined  partitioning  parameters.  Items  4-5  allow  the  user  to  select  an 
AB  Annealing  partition  using  the  depth-first,  breadth-first,  or  random  partitioning 
algorithms  respectively  to  provide  the  initial  partition.  Reference  Appendix  C  for  more 
detailed  information  on  the  available  menu  options. 
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**********************  GP-TOOL  MAIN  MENU  *********************** 

Select  one  of  the  following  operations: 

1  :  Generate  Delay  and  Adjacency  Information  File 

2  :  Generate  SGE  Data  File 

3  :  Generate  Topological  Sort  File 

4  :  Generate  Strong  Conqponents  File 

5  :  Generate  Behavior  to  Logical  Process  (LP)  Mapping  File(s) 

0  :  Quit  GP-Tool 


Enter  your  menu  choice  now: 


Figure  31.  GP-Tool  Main  Menu 


425  Strong  Component  Search.  In  directed  graph  terminology,  a  strongly 
connected  ctmiponent  is  defined  as  a  maximal  set  of  vertices  with  the  pr(^)erty  that  there 
is  a  path  between  any  two  vertices  in  the  set  (10:488).  In  terms  of  a  problem  graph  for  a 
VHDL  circuit  simulation,  a  strongly  connected  comptment  represents  a  complete 
feedback  loq),  such  as  the  one  in  Figure  33.  Such  a  feedback  loop  creates  a  circular 
dependency  during  simulation.  It  is  common  fm  such  feedback  loq[>s  in  digital  circuits  to 


************  gp-tool  behavior  mapping  menu  ************* 

Select  one  of  the  following  operations: 

1  :  Generate  Random  Partitioning  File 

2  :  Generate  Single  Depth-First  Partitioning  File 

3  :  Generate  Simple  Breadth-First  Partitioning  File 

4  :  Generate  ABl-Annealing  Partitioning  File 

5  :  Generate  AB2-Annealing  Partitioning  File 

6  :  Generate  AB3-Annealing  Partitioning  File 

7  :  Turn  the  .MAP  and  .ARCS  output  OFF 

8  :  Modify  the  Cost  Function  Parameters 

9  :  Return  to  Main  Menu 
0  :  Quit  GP-Tool 


Enter  your  menu  choice  now: 


Figure  32.  GP-Tool  M{q)ping  Sub-Menu 
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IN1:0->1 
IN2: 1 

Figure  33.  Example  Feedback  Ixx^  -  Simple  Oscillatcx’ 

involve  a  large  number  of  signal  state  changes.  By  isolating  the  feedback  loc^  tm  a  single 
LP,  it  may  be  possible  to  reduce  the  amount  of  inter-LP  message  traffic. 

For  example,  consider  the  simple  oscillator  circuit  in  Figure  33.  Signal  IN2  is  tied 
high,  while  signal  INI  is  initially  low  and  is  taken  high  to  begin  the  output  oscillaticm. 
When  INI  is  taken  high,  the  output  signal  will  change  states  every  5  ns  (the  sum  of  the 
delays  of  the  AND  and  XOR  gates).  If  the  AND  and  XOR  gates  are  assigned  to  different 
LPs,  two  inter-LP  messages  (AND  to  XOR,  and  XOR  to  AND)  will  result  for  each 
change  of  the  output  state.  If  the  AND  gate  is  placed  on  the  same  LP  as  the  XOR  gate, 
however,  these  messages  will  no  longer  be  necessary  (all  communications  will  take  place 
via  the  local  behavior  and  active  signal  lists). 

The  strong  components  of  the  problem  graph  are  found  using  the  following  algcvithm 
(10:489): 

•  Perform  a  depth-first  search  on  the  input  graph  with  the  arcs  in  the  reverse 
direction,  keeping  track  of  the  order  in  which  the  vertices  are  finished.  A  vertex  is 
finished  when  all  paths  leaving  the  vertex  have  been  fully  explored. 

•  Perform  a  second  depth-first  search  on  the  input  graph  with  the  arcs  in  the  forward 
direction.  However,  begin  new  depth-first  trees  by  considering  the  vertices  in  the 
reverse  order  of  their  finishing  times  in  the  initial  search  of  the  previous  step. 
Keep  track  of  the  depth-first  trees  of  this  second  search. 

•  Output  the  depth-first  trees  from  the  second  search.  Each  one  of  these  trees 
corresponds  to  a  strongly  connected  component  of  the  input  graph. 
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42.6  Simple  Depth-First  (SDF)  Partition.  The  implementation  of  the  Sinq)le 
Depth-First  (SDF)  partitioning  algorithm  is  based  upon  the  depth-first  search  routine  used 
in  finding  the  strong  comptments.  The  algorithm  cmisists  of  the  following  steps: 

•  Calculate  the  expected  size  of  the  current  LP  by  dividing  the  number  of 
unassigned  vertices  by  the  number  of  LPs  remaining  to  be  filled,  rounding  up  to 
the  nearest  integer.  Reset  the  vertex  counter  to  zero. 

•  While  the  vertex  counter  is  less  than  the  expected  size  of  the  current  LP,  traverse 
the  graph  in  a  depth-first  manner  with  the  arcs  in  the  forward  direction  using  a 
source  vertex  as  the  starting  point.  As  previously  undiscovered  vertices  are 
visited,  assign  them  to  the  current  LP,  mark  them  as  discovered,  and  increment  the 
vertex  counter.  If  a  newly  discovered  vertex  is  part  of  a  strong  cmnponent,  assign 
the  entire  strong  component  to  the  current  LP  and  increment  the  vertex  counter  by 
the  size  of  the  strong  component.  Note  that  this  may  put  the  vertex  counter  over 
the  limit  set  by  the  size  of  the  LP  calculated  in  the  previous  step.  Finding  the 
strong  components  of  the  grtyih  prior  to  performing  the  SDF  partition  is  optional. 
In  the  current  version  of  GP-Tool,  the  SDF  partition  by  itself  does  not  consider 
strong  components.  However,  when  the  SDF  partition  is  used  as  the  initial 
partition  for  the  AB- Annealing  algorithm,  a  strong  component  search  is  performed 
as  the  first  step  in  the  partitioning  process. 

•  If  the  current  depth-first  search  tree  is  completed  before  the  current  LP  has 
reached  its  target  size,  begin  a  new  search  by  choosing  another  undiscovered 
source  vertex  as  the  next  starting  point.  If  no  more  source  vertices  remain,  choose 
an  arbitrary  undiscovered  vertex  as  the  next  starting  point. 


A  source  vertex  is  one  with  no  inputs. 
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•  If  the  current  LP  reaches  its  target  size  before  the  current  depth-first  search  tree  is 
completed,  the  search  is  terminated  and  the  process  repeats  starting  again  at  step 
one  for  the  next  LP.  A  new  depth-first  search  tree  is  started  for  each  successive  LP 
in  an  attempt  m  increase  the  probability  of  assigning  a  complete  depth-first  search 
tree  to  the  LP.  This  is  desirable  because  each  depth-first  tree  represents  a  set  of 
dependent  tasks,  and  assigning  dependent  tasks  to  the  same  LP  will  reduce  the 
inter-LP  communications  overhead. 

This  algorithm  is  similar  to  the  Depth-First  Breadth-Next  (DFBN)  algorithm  discussed  in 
section  2.4.8,  except  that  load  balancing  is  considered  in  the  SDF  algorithm  whereas  it  is 
not  addressed  in  the  DFBN  algorithm.  Some  characteristic  traits  of  the  partitions 
generated  by  the  SDF  algorithm  are  as  follows: 

•  The  first  LP  will  contain  long  paths  of  dependent  behaviors  with  a  large  number 
of  local  arcs. 

•  Each  successive  LP  will  tend  to  have  shorter  paths  of  dependent  behaviors  than 
the  preceding  LP  as  it  gets  more  difficult  to  find  long  paths  of  dependent 
behaviors  which  are  not  yet  assigned  to  an  LP. 

•  The  final  LP  will  consist  of  the  fragments  of  the  problem  graph  that  were  not 
assigned  to  a  previous  LP,  and  will  tend  to  have  a  relatively  smaU  number  of  local 
arcs. 


4.2.7  Single  Breadth-First  (SBF)  Partition.  The  implementation  of  the  Simple- 
Breadth  First  (SBF)  partitioning  algorithm  is  identical  to  that  of  the  SDF  algorithm  with 
the  following  exceptions: 

•  The  problem  graph  is  traversed  in  a  breadth-first  manner. 
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•  When  an  LP  is  full,  the  graph  traversal  for  the  subsequent  LP  assignment  picks  up 
where  the  previmis  one  had  left  off.  The  bieadth<first  search  tree  is  not  terminated 
prematurely. 

4J.8  AB  Border  Annealing  Algorithm.  The  implementation  of  the  AB  Boeder 
Annealing  algorithm  ctnrresponds  to  the  steps  discussed  in  section  3.4.3.  However,  before 
beginning  the  first  iteration,  the  graph  is  evaluated  to  initialize  several  data  structures 
with  statistical  information  concerning  the  state  of  the  initial  partition.  The  most 
significant  of  these  data  structures  is  the  initial  value  of  the  communications  cost  sub¬ 
function: 

where 

because  the  value  of  Lares  is  not  known  until  the  final  state  of  the  partition  has  been 
reached.  It  is  calculated  as  part  of  the  routine  that  prints  the  corresponding  ipx .  arcs  file. 
The  current  algorithm  for  conqiuting  Laics  is  time  consuming  and  including  it  in  each  step 
of  the  annealing  process  would  render  the  algorithm  computationally  infeasible.  The 
algorithm  for  computing  Laics  is  discussed  furdier  in  section  5.4. 

In  addition  to  ignoring  the  effect  of  the  lookahead,  an  additional  option  has  been 
added  to  the  annealing  input  parameters; 

•  ignox«_caaan_Di8tjractor  -  boolean  value  that  allows  the  factor  H<]  to  be 
ignored  when  computing  the  value  of  die  communications  sub-function  during  the 
annealing  process. 

When  ignore_coinm_oist_Factor  is  true,  the  communications  cost  is  calculated  as; 
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During  data  analysis,  it  was  discovered  that  in  some  circumstances,  the  annealing 
algorithm  has  a  tendency  to  accept  a  small  increase  in  Owes  in  order  to  gain  a  decrease  in 
Hd.  However,  the  resulting  partition  resulted  in  a  decrease  in  simulatitm  petfcmnance  over 
the  initial  partitimi,  indicating  that  the  increase  in  Oaics  dominated  the  decrease  in  Hd- 
The  q>tion  ignore_conim_Dist_Factor  was  included  to  f(»ce  the  algorithm  to  accept  an 
increase  in  Hd  in  order  to  decrease  die  remaining  portion  of  the  cost  function  (i.e.,  HnHc). 

Once  the  initial  communications  costs  have  been  calculated,  the  annealing  process 
begins  as  shown  in  Figure  34.  The  annealing  queue  is  filled  by  the  procedure 
Prior  it  ize_And_Queue  using  the  Criteria  discussed  in  section  3.4.3.I.  Once  the  queue 
has  been  filled,  vertices  are  removed  in  priority  order  and  considered  for  LP  reassignment 
by  the  procedure  Consider_Vertex. 

The  procedure  consider_vertex  initializes  several  array  data  structures  with 
information  concerning  the  impact  on  the  objective  cost  function  of  reassigning  the  given 
vertex  to  each  viable  destination  LP.  Only  those  LPs  which  contain  behaviors  that  are 
directly  connected  to  the  given  vertex  are  considered  viable.  The  specific  data  structures 
maintained  are: 

•  ii^t_Arc8_Arx«y  -  Records  the  number  of  input  arcs  to  the  subject 
behavior  that  originate  from  behaviors  in  each  of  the  other  LPs. 

•  0utput_Azc8_Arzay  -  Records  the  number  of  output  arcs  from  the  subject 
behavior  that  go  to  behaviors  in  each  of  the  other  LPs. 

•  W9ht_Azc8_Arr«y  -  Records  the  net  change  in  the  value  of 
wght_inter_LP_Arcs  (sum  of  the  inter-component  arcs  with  each  arc  multiplied 
by  the  corresponding  hop  weight).  Used  to  calculate  the  change  to  He- 

•  Rastevjcoomjurray  -  Records  the  net  change  in  the  value  of 
Maxdev_wght_out_Arc3.  Used  to  calculate  the  change  to  Hd. 


85 


Figure  34.  AB  Border  Annealing  Algorithm  Cycle 


•  stdd*^jCoan_Array  -  Records  the  net  change  in  the  value  of 

stddev_wght_out_Arcs.  Used  as  a  tie  breaker,  if  necessary. 

•  Haacd«v__Loadjaxr«y  -  Records  the  net  change  in  the  value  of 

Maxdev_comp_Load.  Used  to  calculate  the  change  to  Hb. 


86 


•  Output_iiiJM)_Axray  -  Reccmis  the  new  value  of  LP_output_Lines  (numbo' 
of  arcs  in  tfie  LP  connectivity  graph).  Used  to  calculate  the  change  to  Hn- 
Each  of  these  data  structures  is  a  one-dimensional  array  indexed  by  the  destination  LP 
number.  If  the  destination  LP  is  not  a  viable  destination,  the  ccnresponding  values  in  the 
Wght_Arca_Array,  Maxdev_Comin_Array,  and  Stddev_Comm_Array  ate  set  to  an 
arbitrarily  large  value  (e.g.  2  times  the  number  of  arcs)  to  effectively  eliminate  these  LPs 
horn  move  consideration. 

The  first  two  data  structures,  input_Arcs_Array  and  output_Arcs_Array,  are 
initialized  once  at  the  beginning  of  the  procedure  consider  vertex.  In  a  worst  case 
scenario,  a  given  behavior  has  input  arcs  from  all  other  LPs  and  output  arcs  to  all  other 
LPs.  In  this  situation,  it  would  take  0(N)  operations  to  initialize  these  data  structures 
(where  N  is  the  number  of  behaviors  in  the  graph).  However,  on  average,  each  behavior 
will  have  only  E/N  input  arcs  and  E/N  output  arcs  (where  E  is  the  number  of  arcs  in  the 
graph).  For  a  given  circuit,  E  and  N  are  fixed  and  have  a  relatively  small  ratio.  Thus,  on 
average,  it  takes  only  C)(2E/N) »  (XU  operations  to  initialize  these  data  stmctures. 

For  viable  destination  LPs,  however,  the  net  change  to  the  communications  cost 
factors  must  be  calculated.  Although  the  communications  costs  between  LPs  are  recorded 
in  an  P2  data  structure  (the  connn_weight_Matrix),  where  P  is  the  number  of  LPs,  only 
those  rows  and  columns  associated  with  the  source  and  destination  LP  will  be  affected  by 
the  move.  Thus,  the  net  change  to  the  communications  cost  factors  for  a  particular 
destination  LP  are  calculated  in  0(P)  time.  Since  there  are  (P-1)  potential  viable 
destination  LPs,  the  upper  limit  of  the  running  time  order  of  the  Consider  vertex 
procedure  is  0(P(P-1))  =  0(P2)  (assuming  0(1)  average  time  to  calculate  the  input/output 
dependencies  of  a  viable  destination  LP  as  discussed  in  the  previous  paragraph). 
However,  as  P  is  increased,  it  is  reasonable  to  expect  that  only  a  small  fraction  of  the  LPs 
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will  be  viable  destinaticms  for  the  average  vertex,  making  the  average  running  time  fm 
consicier_vertex  approximately  0(P). 

The  data  structures  initialized  by  the  procedure  consider_vertex  are  passed  to  the 
procedure  seiect_Be3t_Move  which  evaluates  the  viable  moves  to  find  the  one  which 
will  result  in  the  smallest  value  of  the  communications  cost  sub-function: 

or 

if  ignore_coiran_Di3t_Factor  is  set  to  true.  If  the  destination  LP  with  the  smallest 
communications  cost  sub-function  value  will  result  in  a  change  to  Hb  that  will  put  it  over 
the  nuximum  value  (Load_iinbai_Toi),  the  selected  noove  is  discarded  and  the  next  best 
move  is  sought  In  no  case  will  an  increase  in  the  communications  cost  sub-function  be 
allowed.  Thus,  it  is  possible  that  no  move  will  be  selected.  The  procedure’s  running  time 
is  C)(P)  since  each  LP  must  be  considered. 

If  a  move  is  selected,  it  is  carried  out  by  the  procedure  Move_vertex.  The  move 
involves  an  update  to  the  partition  statistics  values  to  record  the  new  cost  factors,  as  well 
as  a  series  of  simple  list  inserts  and  deletes  to  assign  the  vertex  to  the  new  LP.  The 
procedure  Move_vertex  has  a  running  time  of  0(P)  since  the  communications  weight 
matrix  must  be  updated  to  contain  the  new  values  for  the  rows  and  columns  associated 
with  the  source  and  destination  vertex. 

If  the  annealing  queue  is  not  empty,  the  next  vertex  is  removed  and  the  consideration 
process  is  repeated.  If  the  annealing  queue  is  empty,  the  current  iteration  is  completed.  If 
the  maximum  number  of  iterations  has  not  been  reached,  the  value  of  the  communications 
cost  sub-function  is  compared  to  the  value  computed  at  the  end  of  the  previous  iteration 
(or  at  the  beginning  of  the  algorithm  for  the  Hrst  iteration).  If  there  was  no  net 
improvement,  the  iteration  is  considered  “worthless.”  If  there  have  been 
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Max_Northiess_iter  consecutive  iterations  that  were  worthless,  the  annealing  process  is 
terminated. 

With  the  streamlined  implementations  of  the  procedures  consider  vertex  and 
Move_vertex,  the  most  time  consuming  portion  of  the  annealing  cycle  {Q)pears  to  be  the 
procedure  Prioritize_And_Queue.  This  appears  to  be  due  to  the  linear  data  structures 
used  to  provide  priority  queue  management.  A  splay  tree  queue  implementation  may 
provide  a  higher  level  of  efficiency. 

4.3  Test  Cases 

43.1  Wallace-Tree  Multiplier.  Prior  to  this  thesis  effort,  the  Wallace  tree 
multiplier  was  the  largest  VHDL  circuit  simulated  in  parallel  at  AFTT  with  the  VSIM 
simulator.  The  multiplier  takes  two  eight  bit  vector  as  input  and  produces  a  single  twelve 
bit  product  vector  as  output  (4:131).  The  resulting  problem  graph  consists  of  1,050 
behaviors  and  1,770  inter-behavior  arcs.  The  simulation  runs  from  0  to  2000  ns. 

4.3.2  Associative  Memory  Array.  The  associative  menxiry  array  circuit  consists 
of  a  16  X  16  memory  array,  associated  control  registers,  and  68  vector  testbench.  The 
associative  memory  is  currently  the  largest  circuit  simulated  with  VSIM.  The  resulting 
problem  graph  consists  of  4,243  behaviors  and  9,312  inter-behavior  arcs.  The  testbench 
consists  of  the  following  actions: 

1.  Write  to  all  memory  words  in  order  from  word  0  to  word  15  (16  writes). 

2.  Read  all  memory  words  in  order  from  word  0  to  word  15  (16  reads). 

3.  Search  for  certain  pre-specified  patterns  (16  searches). 

4.  Read  from  all  memory  words  in  an  arbitrary  order  (16  reads). 

5.  Perform  read  operations  with  multiple  words  selected  (4  reads). 
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Enable 


Figure  35.  Associative  Memory  Array 


The  circuit  is  built  in  a  hierarchical  manner  to  allow  for  easy  transformation  and 
compilation  with  VSIM.  When  written  to  run  with  the  Synopsis  commercial  VHDL 
simulator,  the  simulation  ran  from  0  to  8000  ns.  However,  VSIM  has  a  limit  of 
approximately  2000  ns  because  of  the  data  type  used  to  represent  the  simulation  clock.  To 
get  around  this  problem,  all  time  units  in  the  associative  memory  VHDL  source  code 
were  changed  to  picoseconds.  Thus,  the  simulation  runs  from  0  to  8000  ps  (8  ns). 

A  block  diagram  for  the  associative  memory  circuit  is  shown  in  Figure  35.  It  should 
be  noted  that  all  three  input  registers  and  all  three  output  registers  are  clocked  by  the 
same  enable  signal  as  a  matter  of  design  convenience.  A  result  of  this  is  that  during 
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memory  writes,  the  value  of  the  tag  output  register  oscillates  rapidly.  This  has  the  effect 
of  adding  a  large  number  of  events  to  the  simulation,  slowing  down  the  simulation. 
Observation  of  the  simulation  shows  that  the  simulation  progresses  slowly  until  the  writes 
are  completed  (at  time  20(X)  ps),  at  which  time  it  begins  to  progress  at  a  much  faster  pace. 
The  correctness  of  the  output  remains  unaffected  since  the  content  of  the  tag  registw  is 
not  relevant  during  a  memory  write. 
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V.  Methodology  and  Results 


5.1  Overview 

This  chapter  presents  a  discussion  of  the  performance  of  the  VSIM  test  cases 
discussed  in  the  previous  chapter.  Four  different  partitionings  created  with  the  GP-Tool 
utility  are  used.  Specifically,  each  circuit  was  simulated  with  a  random  partition,  a  simple 
depth-first  (SDF)  partition,  and  a  simple  breadth-first  (SBF)  partition.  The  best  of  these 
three  partitions  was  then  used  as  the  initial  partition  for  the  AB  border-annealing 
algorithm  to  create  a  fourth  partition. 

The  primary  performance  measurements  were  taken  on  the  8-node  iPSC/2.  All 
speedup  calculations  use  a  single-LP  partition  as  the  performance  baseline.  Each  circuit 
was  simulated  with  2, 3, 4, 5,  6, 7,  and  8  LPs  for  each  of  the  four  partition  types.  For  the 
Wallace  tree,  the  performance  measurements  and  message  counts  for  each  configuration 
are  calculated  from  the  average  of  20  simulation  trials.  For  the  associative  memory  array, 
only  10  simulation  trials  were  run  for  each  configuration  due  to  the  large  amount  of  time 
required  for  each  trial. 

The  results  of  each  circuit  are  discussed  in  terms  of  the  resulting  speedup,  the  inter-LP 
message  traffic,  and  the  conesponding  partition  statistics.  The  inter-LP  message  traffic  is 
analyzed  in  terms  of  message  traffic  originating  from  each  LP.  Tables  containing  the 
simulation  trial  results  and  partition  statistics  for  each  combination  involving  1, 2, 4,  or  8 
LPs  are  included  in  Appendix  D  along  with  supplemental  inter-LP  message  traffic  charts. 

In  addition  to  the  iPSC/2  results,  the  Wallace  tree  multiplier  was  run  on  an  iPSC/860 
hypercube  using  up  to  32  nodes.  These  results  are  presented  briefly  in  section  5.6. 
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Figure  36.  Wallace  Tree  Speedup  Results  Comparison 
5.2  Speedup  Results 

5.2.1  Wallace-Tree  Multiplier.  The  Wallace  tree  speedup  results  are  shown  in 
Figure  36.  All  speedup  calculations  are  in  reference  to  the  single  LP  case  which  required 
an  average  of  67,947  ms  to  complete.  Figure  37  compares  the  number  of  LP  output  lines, 
the  number  of  inter-LP  arcs,  and  the  communications  distribution  factor  for  each  of  the 
partitioning  algorithms,  while  Figure  38  shows  the  corresponding  message  counts.  The 
message  counts  are  shown  in  terms  of  real,  null,  and  total  messages  sent  hrom  all  LPs. 

As  shown  in  Figure  36,  the  random  partitioning  speedup  peaks  at  2.72  with  3  LPs  and 
declines  to  1.23  with  8  LPs.  Observation  of  Figures  37  and  38  shows  that  the  reason  for 
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this  sharp  drop-off  in  speedup  is  due  to  a  dramatic  increase  in  the  inter-process 
communications  overhead  as  the  number  of  LPs  is  increased.  Specifically,  Figure  37 
shows  that  for  8  LPs,  the  random  partition  has  the  noaximum  of  56  LP  ouq>ut  lines  and  a 
total  of  1,558  of  1,770  arcs  (88%)  that  cross  the  LP  boundaries.  Comparing  these  curves 
to  Figure  38  shows  the  direct  relationships  between  the  number  of  LP  ouq)ut  lines  and  the 
number  of  null  messages  transmitted,  and  between  the  number  of  inter-LP  arcs  and  the 
number  of  real  messages  transmitted. 

Note  that  the  random  partitioning  real  message  curve  (Figure  38)  decreases  in  slope  as 
the  number  of  LPs  is  increased.  This  occurs  as  the  number  of  random  partiticm  inter-LP 
arcs  quickly  approaches  its  theoretical  maximum  (e.g.  49%  with  2  LPs  and  75%  with  4 
LPs).  This  correlates  directly  to  the  number  of  real  messages  approaching  its  maximum 
limit  at  approximately  the  same  rate.  The  maximum  number  of  real  messages  is 
determined  by  the  number  of  actual  signal  changes  in  the  simulation.  If  100%  of  the  arcs 
cross  LP  boundaries,  then  every  signal  change  in  the  simulation  will  result  in  the 
transmission  of  a  real  message. 

On  the  other  hand,  the  slope  of  the  null  message  curve  for  random  partitioning  shows 
a  trend  of  increasing  as  the  number  of  LPs  is  increased.  This  follows  from  the  direct 
relationship  between  the  number  of  null  messages  and  the  number  of  LP  output  lines 
along  with  the  fact  that  the  random  partitioning  algorithm  has  a  tendency  to  produce  the 
maximum  of  P(P'l)  LP  output  lines.  The  theoretical  limit  on  the  number  of  LP  output 
lines  is  the  same  as  the  number  of  inter-LP  arcs.  However,  the  limiting  factor  of  P(P-1) 
means  that  a  larger  number  of  LPs  will  be  required  for  the  actual  number  of  LP  output 
lines  to  approach  its  maximum.  Furthermore,  while  the  number  of  inter-LP  arcs  and  LP 
output  lines  share  a  common  theoretical  maximum,  the  maximum  number  of  null 
messages  may  be  much  higher  than  the  maximum  number  of  real  messages.  This  is  due  to 
several  factors.  First,  each  real  message  transmitted  may  result  in  the  transmission  of 
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multiple  null  messages.  Second,  null  messages  are  transmitted  over  all  ouQNit  lines  each 
time  an  UP  must  block  for  input.  Finally,  the  number  of  null  messages  is  partially 
dependent  upon  the  amount  of  lookahead  in  the  corresponding  ipx .  arcs  file. 

An  impcntant  result  of  this  is  that  as  the  number  of  IPs  is  increased,  the  null  messa^ 
communications  overhead  required  to  avoid  deadlock  begins  to  dominate  the  overhead 
from  the  real  message  traffic.  For  example,  the  approximate  null  to  real  message  ratios 
for  random  partitioning  with  2, 4,  and  8  LPs  are  1:4, 1:1,  and  4:1  respectively. 

The  SDF  partitioning  speedup  curve  is  slightly  better,  peaking  at  3.04  with  5  LPs  and 
decreasing  to  1.50  with  8  LPs.  Looking  at  Figure  37,  the  most  notable  improvement 
between  the  random  and  SDF  partitions  is  in  the  number  of  inter-LP  arcs.  With  the 
deliberate  depth-first  partitioning  algorithm,  the  number  of  inter-LP  arcs  approaches  its 
maximum  at  a  much  slower  rate,  reaching  only  30%  with  8  LPs  (vs.  88%  for  the  randmn 
partition).  Figure  38  shows  how  this  improvement  translates  directly  into  a  similar 
improvement  in  the  number  of  real  messages  transmitted.  Ftxr  example,  with  8  LPs,  there 
is  a  73%  reduction  in  the  number  of  real  messages.  In  addition,  the  SDF  partitioning 
algorithm  reduced  the  number  of  LP  output  lines  to  only  45  with  8  LPs  (vs.  56  for  the 
random  partition).  Again,  Figure  38  shows  how  this  improvement  translates  directly  into 
a  similar  improvement  in  the  number  of  null  messages  transmitted  (e.g.,  27%  reduction  in 
the  number  of  null  messages  with  8  LPs).  It  should  be  noted  that  an  increase  in  the 
amount  of  lookahead  in  the  Jox.arcs  files  and  the  decrease  in  the  number  of  real 
messages  also  contributed  to  the  decrease  in  the  number  of  null  messages.  The  increased 
lookahead  is  due  to  th ,  ^.bility  of  the  SDF  algorithm  to  assign  relatively  long  chains  of 
dependent  behaviors  to  the  same  LP.  In  many  cases,  this  will  increase  the  minimum  path 
time  through  the  LP. 

Even  with  this  reduction  in  the  number  of  null  messages,  however,  the  shape  of  the 
null  message  curve  is  still  prcportionai  to  P(P-l).  As  a  result,  the  null  message  overhead 
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still  dcHninates  the  real  message  overhead  as  the  number  of  LPs  increases.  In  fact,  due  to 
the  reduction  in  the  number  of  real  messages,  fewer  LPs  are  requited  until  null  messages 
begin  to  dominate  real  messages.  Fcmt  exanqple,  the  null  to  real  message  ratio  fw  the  SDF 
partition  begins  at  2:1  with  2  LPs,  and  increases  to  1 1:1  fmr  8  LPs.  This  domination  of  the 
null  messages  offsets  the  apparent  beneHts  gained  by  the  decrease  in  the  real  message 
traffic.  For  example,  with  8  LPs,  there  was  a  73%  reduction  in  real  messages,  but  only  a 
36%  reduction  in  total  messages.  The  continued  high  communications  cost  is  the  (mmary 
reason  that  the  speedup  curve  for  the  SDF  partition  drops  off  rapidly  with  more  than  5 
LPs. 

An  additional  factor  which  appears  to  limit  the  speedup  gains  of  the  SDF  partition  is 
the  relatively  high  value  of  H<i  as  shown  at  the  bottom  of  Figure  37.  The  inter-LP  arcs  of 
the  random  partition  are  relatively  evenly  distributed  among  each  of  the  LPs.  However, 
the  SDF  algorithm  tends  to  result  in  partitimis  in  which  a  relatively  large  portion  of  the 
inter-LP  arcs  will  originate  from  a  single  LP.  Tlie  relationship  between  H<i,  the  inter-LP 
message  traffic,  and  the  resulting  speedup  is  discussed  further  in  section  5.4. 

The  SBF  partition  speedup  curve  is  even  better  than  the  SDF  curve,  peaking  at  3.43 
and  declining  at  a  much  slower  rate  to  2.86  with  8  LPs.  While  Figure  37  shows  that  both 
algorithms  result  in  partitions  with  approximately  the  same  number  of  inter-LP  arcs,  the 
SBF  algorithm  is  able  to  take  advantage  of  the  feed-forward  nature  of  the  wallace-tree 
circuit  and  produce  a  circuit  with  a  significantly  smaller  number  of  LP  output  lines  (e.g., 
29  with  8  LPs  vs.  45  for  the  SDF  partition).  As  shown  in  Hgure  38,  this  corresponds  to  a 
slower  growth  rate  in  the  null  message  curve.  Still,  with  the  corresptmding  reduction  in 
real  message  traffic,  the  communications  overhead  is  dominated  by  the  null  message 
traffic  with  a  null  to  real  message  ratio  of  2: 1  with  2  LPs  and  growing  to  6: 1  with  8  LPs. 
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Since  the  SBF  partition  provided  the  best  speedup  results,  it  was  used  as  the  initial 
partition  to  the  AB  annealing  algorithm  to  get  the  fourth  q)eedup  curve  shown  in  Figure 
36.  The  following  input  parameters  were  used  to  the  annealing  algmithm: 

Num_Ite  rat  ions  -  500  lgnore_Coniii_Dist_Factor  -  true 

Max_Worthless_Iter  -  25  Topological  -  false 

Load_Imbal_Tol  -  1.5%  Hop_Weights  -  all  1.0 

With  ignore_comin_Dist_Factor  set  to  false,  the  algorithm  reduced  Hj  and  He  at  the 
expense  of  a  slight  increase  in  Hn.  The  result  was  a  reduced  and  more  evenly  distributed 
real  message  communications  load  that  was  overwhelmed  by  an  increased  null  message 
communications  load  resulting  in  no  net  improvement  in  the  speedup  curve.  By  setting 
ignore_  conim_Dist_Factor  to  true,  the  algorithm  has  a  tendency  to  reduce  Hn  and  He 
at  the  expense  of  a  potential  increase  in  Hd-  This  effect  can  be  seen  by  comparing  the  SBF 
and  AB  Annealing  curves  in  Figure  37. 

Figures  37  and  38  show  that  the  AB  Annealing  partition  caused  a  reduction  in  the 
number  of  inter-LP  arcs  with  a  corresponding  reduction  in  the  amount  of  real  message 
traffic  for  all  LP  values.  For  example,  with  8  LPs,  the  number  of  real  messages 
transmitted  has  been  reduced  by  83%  over  the  random  partition.  However,  the 
corresponding  speedup  results  were  mixed,  with  no  noticeable  improvement  over  the  SBF 
partition  for  2,  3, 4, 3,  and  7  LPs.  The  speedup  curve  shows  a  nxxlest  improvement  with  6 
LPs,  and  a  more  significant  improvement  with  8  LPs  where  it  peaks  at  3.89.  The  large 
improvement  with  8  LPs  is  due  to  a  decrease  in  null  message  traffic  caused  by  a  reduction 
in  LP  output  lines  from  29  to  26,  along  with  an  increase  in  the  average  lookahead. 

There  are  a  number  of  interesting  observations  that  can  be  made  about  the  AB 
annealing  speedup  curve.  First,  the  only  factor  that  appears  to  be  limiting  the  performance 
improvement  for  a  majority  of  the  data  points  is  an  increase  in  the  value  of  Hd.  This  lends 
credibility  to  the  assertion  that  the  distribution  of  the  inter-process  communications  is  a 
significant  contributing  factor  to  simulation  performance.  On  the  other  hand,  the  largest 
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increase  in  Hd  occurs  with  8  LPs  which  cwresponds  to  the  data  point  widi  the  greatest 
improvement  in  simulation  speedup.  In  this  instance,  the  increase  in  Hu  appears  to  be 
dominated  by  the  reduction  in  Hn  and  He.  This  phencunenon  is  likely  due  to  a  failure  of 
the  cost  factor  Hd  to  accurately  ctq)ture  the  relationship  between  the  distribution  of  the 
inter-process  communications  and  the  resulting  simulation  performance. 

5J2  Associative  Memory  Array.  The  associative  memory  speedup  results  are 
shown  in  Figure  39.  All  speedup  calculations  are  in  reference  to  the  single  LP  case  which 
required  an  average  of  4,380,074  ms  to  complete.  Figures  40  and  41  compare  the 
partitions  statistics  and  resulting  message  traffic  respectively. 

As  shown  in  Figure  39,  the  random  partitioning  speedup  peaks  at  3.95  with  S  LPs, 
and  declines  to  3.03  with  8  IPs.  Due  to  the  large  number  of  total  simulation  events  in  the 
associative  memOTy  circuit,  the  communications  overhead  for  random  partititMiing  is 
dominated  by  real  message  traffic  for  the  number  of  IPs  used  in  this  thesis.  For  example, 
the  null  to  real  message  ratio  begins  at  1:25  with  2  IPs,  and  increases  to  1:2  with  8  IPs. 
Further  increases  in  the  number  of  IPs  will  swing  the  ratio  the  other  way,  and  the 
communications  overhead  will  be  dominated  by  the  null  message  overhead.  This  can  be 
seen  by  observing  the  slopes  of  the  real  and  null  message  curves  in  Figure  41. 

The  SDF  partitioning  speedup  curve  shows  consistently  better  results  than  the  random 
partition,  peaking  at  4.49  with  5  LPs,  and  decreasing  to  3.53  with  8  IPs.  Looking  at 
Figure  40,  it  can  be  seen  that  SDF  partitioning  provides  a  noticeable  improvement  over 
random  partitioning  in  terms  of  both  the  number  of  inter-LP  arcs  and  the  number  of  LP 
output  lines.  For  example,  with  8  LPs,  the  number  of  inter-LP  arcs  has  been  reduced  by 
61%  (from  8129  to  3145),  while  the  number  of  LP  output  lines  has  been  reduced  by  21% 
(from  56  to  44).  However,  over  50%  of  the  remaining  inter-LP  arcs  originate  from  a 
single  LP,  resulting  in  a  value  of  H^  of  333.2%. 
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Figure  39.  Associative  Memory  Speedup  Results  Comparison 

Loddng  at  Figure  41,  the  effect  on  the  real  message  curve  was  even  more  dramatic 
than  the  speedup  curve,  with  the  number  of  real  messages  and  null  messages  being 
reduced  by  90%  and  41  %  respectively  in  the  8  LP  case.  However,  due  to  the  decrease  in 
real  message  traffic  caused  by  the  improved  partition,  the  null  messages  begin  to 
dominate  the  communications  overhead  at  a  much  earlier  point.  The  null  to  real  message 
ratio  with  8  LPs  is  3:1  (vs.  1:2  for  the  random  partition).  Nevertheless,  the  total 
communications  overhead  was  reduced  by  over  72%  with  8  LPs.  Although  this 
improvement  was  significant,  the  corresponding  speedup  was  improved  by  less  than  17%. 
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Figure  40.  Associative  Memory  Partition  Statistics  Comparison 
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The  SBF  partitioning  algorithm  performed  very  poorly  on  the  associative  memmy 
circuit,  giving  speedup  that  was  wmse  than  randmn  partitioning.  The  exact  teastm  fen-  the 
poor  perfcnrmance  of  the  SBF  partitions  is  not  clear,  as  there  were  improvements  in  both 
the  number  of  inter-LP  arcs  and  UP  output  lines.  However,  the  value  of  H<i  was 
consistently  higher,  although  it  was  less  than  that  of  the  SDF  partitions.  It  appears  that  the 
decrease  in  message  overhead  was  not  large  enough  to  overcome  the  increase  in  H<i. 

It  is  interesting  to  note  in  Figure  40  that  the  SBF  and  SDF  partitions  result  in 
approximately  the  same  number  of  LP  output  lines.  However,  Figure  41  clearly  shows 
that  the  SBF  partitions  resulted  in  a  noticeably  higher  amount  of  null  message  traffic. 
There  are  two  reasons  for  this  effect.  First,  the  SDF  partitions  resulted  in  larger  average 
lookahead  values.  This  was  expected  since  the  SDF  algorithm  has  a  better  chance  of 
grouping  long  sequences  of  dependent  behavicffs  due  to  the  order  in  which  the  graph  is 
traversed  during  partitioning.  Second,  when  an  LP  sends  a  real  message  over  one  output 
line,  it  sends  null  messages  over  all  other  output  lines  with  the  same  timestamp. 
Therefore,  the  large  number  of  real  messages  in  the  SBF  partitions  causes  an  increase  in 
the  null  message  overhead. 

For  the  associative  memoiy  circuit,  the  SDF  partition  was  used  as  the  initial  partition 
to  the  AB  annealing  algorithm  to  get  the  fourth  speedup  curve  shown  in  Figure  39.  The 
following  input  parameters  were  used  to  the  annealing  algorithm: 

Nuin_Iterations  -  500  Ignore_Connn_Di5t_Factor  -  false 

Max_Worthless_Iter  -  50  Topological  -  false 

Load_Inibal_Tol  -  0.5%  Hop_Weights  -  all  1.0 

Looking  at  Figure  40,  it  is  clear  that  the  border  annealing  algorithm  improved  the 
quality  of  the  partition  in  terms  of  each  of  the  three  communications  cost  factors:  LP 
output  lines,  inter-LP  arcs,  and  communications  distribution  factor.  However,  the 
corresponding  speedup  results  were  decidedly  mixed,  with  the  new  partitions  performing 
the  same  as  the  SDF  partitions  with  5  and  7  LPs.  However,  the  the  highest  speedup 
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obtained  for  the  associative  memory,  4.89,  was  achieved  with  the  6  LP  AB  annealing 
partition.  Although  the  exact  reason  for  the  lack  of  improvement  in  the  5  and  7  LP  cases 
is  not  clear,  there  appears  to  be  several  contributing  factors. 

As  an  example,  comparistHi  of  Figures  40  and  41  shows  that  although  the  number  of 
inter-LP  arcs  is  reduced  with  the  AB  annealing  algorithm,  the  number  of  real  messages 
transmitted  is  slightly  higher.  This  is  possible  because,  as  discussed  in  section  3.3.2. 1,  the 
actual  real  message  communications  load  over  each  arc  is  dependent  upon  the  signal 
activity  of  the  circuit.  Since  this  information  is  not  available  prior  to  simulation,  each  arc 
is  assumed  to  have  an  equal  cost  in  terms  of  message  load. 

Although  real  message  traffic  was  increased  slightly,  the  null  message  traffic  was 
lowered.  As  expected,  this  was  due  to  a  consistent  reduction  in  the  number  of  LP  output 
lines  caused  by  the  AB  border  annealing  algorithm.  However,  the  reduction  in  null 
messages  is  not  as  large  as  might  be  expected  from  the  reduction  in  LP  output  lines.  As 
with  the  SBF  partitions,  this  is  due  to  the  slight  increase  in  real  message  traffic.  The  net 
result  is  a  modest  decrease  in  total  message  traffic  over  the  SDF  partitions,  with  a 
maximum  decrease  of  8.67%  on  7  LPs. 

As  stated  previously,  the  AB  annealing  speedup  results  were  mixed.  The  biggest 
increase  over  the  SDF  partitions,  15%,  occurred  with  6  LPs.  It  is  interesting  to  note  that 
this  corresponds  with  the  biggest  drop  in  the  communications  distribution  factor  Hu  (see 
Figure  40).  While  this  is  consistent  with  the  expected  results  from  the  objective  cost 
function,  it  is  also  interesting  to  note  that  the  smallest  speedup  gains  correspond  to  the 
test  cases  with  the  largest  decrease  in  total  message  traffic  (5  &  7  LPs).  These 
inconsistent  results  provide  a  strong  indication  that  other  factors  are  influencing  the 
simulation  performance. 

As  mentioned  previously,  one  potential  source  of  these  inconsistencies  is  a  failure  by 
the  partition  cost  function  to  capture  the  true  relationship  between  the  distribution  of  the 
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communications  and  the  simulation  perfonnance.  Another  possible  source  of  the 
inconsistency  is  the  artiHcial  feedback  imposed  upon  the  simulation  when  the  benavior 
graph  is  mapped  onto  the  processor  graph.  Although  this  feedback  is  not  addressed 
directly  in  this  thesis,  partitioning  the  circuit  to  reduce  the  number  of  LP  ouq)ut  lines  and 
inter-LP  arcs  also  reduces  the  amount  of  imposed  feedback.  Intuitively,  however,  the  true 
effect  of  these  imposed  feedback  loops  depends  upon  the  behaviors  involved  and  the 
signal  activity  of  the  circuit.  Further  research  is  needed  in  order  to  account  for  this 
overhead  in  the  partition  cost  model. 

5.3  Speedup  Prediction 

As  discussed  in  section  3.3.4.2,  one  of  the  objectives  of  this  thesis  was  to  quantify  the 
relationship  between  the  quality  of  a  partition  and  the  resulting  speedup.  Although  the 
ability  to  predict  the  speedup  from  the  partition  information  may  be  useful,  the  primary 
purpose  of  this  objective  is  to  validate  the  partition  cost  function  developed  in  this  thesis. 
The  coefficient  value  fi  was  arbitrarily  set  to  1.0,  and  the  coefficient  value  a  was  selected 
to  provide  the  desired  relative  weightings  to  the  load  imbalance  and  communications 
portions  of  the  objective  cost  function.  The  Hrst  step  in  selecting  a  was  to  examine  the 
expected  magnitudes  of  the  communicadons  cost  sub-funcdon  (Hn  He  (1  +  Hj))  and  the 
load  imbalance  sub-function  (Hb)  for  a  typical  example  (e.g.,  the  Wallace  tree  multiplier). 
Because  of  the  chosen  methodology  selected  for  representing  the  communications  cost 
factors,  the  communications  cost  sub-function  is  likely  to  produce  a  much  larger  value 
than  the  load  imbalance  sub-function.  Therefore,  it  was  desired  to  make  a  larger  than  B  in 
order  to  account  for  this  difference  in  representation.  An  a  value  of  100.0  was  chosen. 
This  selection  was  validated  by  comparing  the  resulting  expected  speedup  values  against 
the  actual  speedup  curves  using  an  arbitrary  value  for  y*  It  should  be  noted  that  the 
relatively  small  load  imbalances  used  in  this  thesis  (1.5%  max)  would  tend  to  obscure  any 


106 


'■  ■  “  Randan 

O - SDF 

♦^SBF 

AB2_2 


Figure  42.  Wallace  Tree  Speedup  Predictim  Curves 

errors  with  this  relative  weighting  by  causing  the  cost  function  to  always  be  dominated  by 
the  communications  cost  sub-function.  Further  research  is  required  in  order  to  find  the 
correct  relative  weightings.  This  research  should  include  a  re-examination  of  the  “best” 
way  to  represent  the  communications  cost  factors^^. 

The  last  remaining  coefficient,  y,  was  determined  to  be  circuit  dependent. 
Specifically,  it  appears  to  be  inversely  proporticmal  to  the  total  number  of  events  in  the 
simulation.  The  value  for  y  was  selected  separately  for  each  circuit  by  using  trial-and- 
error  to  find  the  value  which  gave  the  best  match  between  the  predicted  vs.  actual 
speedup  curves  for  random  partitioning.  Once  selected,  the  same  value  was  used  for  each 
of  the  other  partitioning  algorithms.  All  three  coefficient  values  are  given  to  GP-Tool, 
which  calculates  the  predicted  speedup  as  part  of  the  partition  statistics  output  file. 

In  general,  the  speedup  prediction  results  were  correct  (but  not  exact)  for  a  clear 
majority  of  the  partition  -  LP  combinations.  Intuitively,  this  seems  to  indicate  that  the 
objective  cost  function  proposed  in  this  thesis  is  able  to  successfully  model  the 
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Predicted  Speedup 


Number  of  LPs 


107 


domuittting  partition  cost  factors  in  most  circumstances.  Furtho-  research  would  allow 
this  cost  model  to  be  refined  to  provide  even  better  results. 

53  J  Wallace-Tree  Multiplier.  The  predicted  speedup  curves  for  the  Wallace  tree 
multiplier  are  shown  in  Figure  42.  For  this  circuit,  a  y  value  of  0.09  was  used.  Although 
the  predicted  speedup  values  were  not  exact,  ctMiiparison  with  Figure  36  shows  that  the 
partition  cost  model  successfully  predicted  the  correct  ordering  of  the  four  partition  types. 
For  example,  it  correctly  predicted  that  the  SBF  partitions  would  perform  better  than  the 
SDF  partitions  which  would,  in  turn,  perform  better  than  the  random  partitions.  Except 
for  the  7  LP  case,  it  also  correctly  predicted  that  the  AB  annealing  partititms  would 
outperform  the  SBF  partitions. 

53  J  Associative  Memory  Array.  The  predicted  speedup  curves  for  the 

associative  memory  array  are  shown  in  Fi<pire  43.  For  this  circuit,  a  y  value  of  0.Q2S  was 
used.  Again,  the  predicted  speedup  values  were  not  exact,  but  comparison  with  Figure  39 
shows  that  the  partition  cost  naodel  successfully  predicted  the  correct  ordering  of  the  four 
partition  types.  For  example,  it  successfully  predicted  that  the  SBF  partition  would 
consistently  perform  worse  than  the  random  partition,  and  that  the  SDF  partition  would 
consistendy  perform  better  than  the  random  partition.  However,  it  failed  to  predict  the 
anomalous  5  and  7  LP  cases  where  the  AB  annealing  partitions  performed  no  betto-  than 
the  SDF  partitions.  This  further  supports  the  assertion  that  under  some  circumstances, 
there  are  factors  which  contribute  to  the  simulation  performance  other  than  those  that  are 
captured  by  the  objective  cost  function  tiKxlel. 
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5,4  Message  Trt^c  Analysis 

This  section  looks  more  closely  at  the  inter-LP  message  traffic  overhead  for  a  few 
representative  test  cases.  Three  types  of  graphs  are  presented:  real  messages  transmitted 
from  each  LP  vs.  null  messages  transmitted  from  each  LP;  total  messages  transmitted 
from  each  LP  vs.  total  inter-LP  arcs  originating  from  each  LP;  and  total  messages 
transmitted  from  each  LP  vs.  total  LP  output  lines  originating  from  each  LP.  In  each 
graph,  all  four  partition  types  are  compared.  Similar  gr^hs  for  additional  test  cases  are 
included  in  Appendix  D. 

5.4.1  Wallace-Tree  Multiplier.  Figure  44  shows  the  real  vs.  null  message  graph  for 
the  4  LP  case,  while  Figure  45  shows  the  same  graph  for  the  8  LP  case.  Qearly,  for  the  4 
LP  random  partition,  the  communications  overhead  is  nearly  evenly  divided  between  real 
messages  and  null  messages.  In  addition,  the  communications  are  evenly  distributed,  with 
each  LP  generating  a  relatively  equal  number  of  messages.  Although  the  total  number  of 
null  messages  is  reduced  for  each  successive  partition  (SDF,  SBF,  and  AB  annealing 
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respectively),  the  total  number  of  real  messages  is  also  reduced  The  resulting  effect  is 
that  the  inter-LP  communications  for  each  LP  are  dominated  by  the  null  message 
overhead.  Figure  45  shows  an  identical  set  of  relaticHiships  fOT  the  8  LP  case,  except  that 
the  null  messages  dominate  the  communications  overhead  for  the  landcHn  partiticm  as 
well.  Intuitively,  further  reductions  in  real  messages  without  significant  reductimis  in  null 
messages  will  have  limited  impact  on  the  total  communications  overhead. 

An  additional  observation  from  Figures  44  and  45  is  that  for  the  deliberate 
partitioning  strategies  of  SDF,  SBF,  and  AB  annealing,  the  remaining  message  traffic  is 
no  longer  evenly  distributed  among  all  of  the  LPs.  Observation  of  this  phenomenon  lead 
to  the  addition  of  the  communications  distribution  factor  (Hd)  to  the  objective  cost 
function  as  discussed  in  section  3.3.2.2.  Figure  46  attempts  to  validate  the 
communications  distribution  factor  by  showing  die  relationship  between  the  total  number 
of  messages  transmitted  from  each  LP  and  the  total  number  of  inter'LP  arcs  originating 
fiom  each  LP. 

In  general,  the  results  are  decidedly  mixed.  There  appears  to  be  a  detectable 
relationship  between  output  arcs  and  messages  transmitted  for  the  random  and  SBF 
partitions,  but  not  the  SDF  or  AB  annealing  partitions.  Additional  examples  with 
similarly  mixed  results  are  included  in  Appendix  D.  Collectively,  this  data  supports  the 
assertion  that  the  distribution  of  the  communications  load  may  be  a  factor  in  the 
simulation  performance,  but  the  cost  function  proposed  in  this  thesis  fails  to  accurately 
model  the  communications  bottleneck  in  some  instances. 
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Figure  44.  Wallace  Tree  4  LP  Reals  Sent  vs.  Nulls  Sent  Message  Analysis 


Random  Firtition  Simple  Depth-First  Partition 


112 


Figure  45.  Wallace  Tree  8  LP  Reals  Sent  vs.  Nulls  Sent  Message  Analysis 
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Figure  46.  Wallace  Tree  8  LP  Total  Messages  Sow  vs.  Output  Arcs 


One  attempt  at  iixq)roving  the  method  of  modeling  the  distribution  of  the  inter-LP 
communications  involved  the  relationship  between  the  total  messages  transmitted  from 
each  LP  and  the  corresponding  number  of  ou^ut  lines  for  that  LP.  This  relationship  was 
investigated  because  of  the  fact  that  fen*  relatively  large  LP  values,  the  null  messages 
dtnninate  the  total  message  counts,  and  the  number  of  LP  output  lines  are  the  major 
contributor  to  the  number  of  null  messages.  Figure  47  shows  this  relationship  fw  the 
Wallace  tree  8  LP  case.  Qearly,  there  is  a  detectable  relationship  fw  the  maj<vity  of  the 
partition  types.  A  single  exception  is  LP7  for  the  AB  annealing  partition.  This  apparent 
anomaly  is  explainable,  however,  by  the  fact  that  LP7  has  no  input  arcs.  With  no  input 
arcs,  it  never  has  to  block  for  input.  Because  it  never  blocks,  it  never  sends  blocking 
nulls.  Additional  examples  with  similar  results  are  included  in  Appendix  D. 

However,  despite  these  positive  results,  revising  the  communications  distribution 
factor  H<|  to  be  based  upon  the  number  of  LP  output  lines  rather  than  the  number  of 
output  arcs  resulted  in  a  degradation  of  the  speedup  prediction  curve  results. 
Nevertheless,  these  results  indicate  that  this  relationship  deserves  further  investigation. 
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Figure  47.  Wallace  Tree  8  LP  Total  Messages  Sent  vs.  LP  Ouq)ut  Lines 


5.42  Associative  Memory  Array.  A  similar  set  of  graphs  is  presented  for  the 
associative  memoiy  array  in  Figures  48  through  51.  Specifically,  Figure  48  shows  the  real 
vs.  null  message  graph  for  the  8  LP  case.  Notice  that  for  the  8  LP  random  partition,  real 
messages  continue  to  dominate  the  total  inter-LP  message  overhead.  However,  as 
mentioned  in  section  S.2.2,  this  situation  will  reverse  itself  as  the  number  of  LPs  is 
increased  further.  Another  interesting  item  from  this  figure  is  the  results  of  the  SBF 
panition.  Notice  that  for  a  few  LPs,  the  communications  overhead  continues  to  be 
dominated  by  real  messages.  Furthermore,  the  number  of  real  messages  transmitted  by 
LPS  and  LP7  each  exceed  the  average  number  of  real  messages  transmitted  by  all  LPs  in 
the  random  partition  by  as  much  as  109%.  This  communications  bottleneck  is  the  most 
likely  culprit  in  the  poor  performance  of  the  SBF  partitions  for  the  associative  memory 
array. 

As  stated  previously,  the  o  LP  AB  annealing  partition  provided  the  maximum  speedup 
for  the  associative  memory  circuit  Figures  49  through  51  show  the  6  LP  case  of  the  real 
vs.  null  messages  graph,  the  total  messages  vs.  output  arcs  graph,  and  the  total  messages 
vs.  LP  output  lines  graph  respectively.  The  results  are  similar  to  those  for  the  Wallace 
tree.  An  exception  is  the  total  messages  transmitted  vs.  the  LP  output  lines  for  the  SBF 
partition  in  Figure  51.  In  this  particular  instance,  there  is  no  discernible  relationship 
between  the  total  number  of  messages  transmitted  by  each  LP  and  the  number  of  output 
lines  originating  from  each  LP.  Additional  examples  with  similar  results  are  included  in 
Appendix  D. 
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Figure  48.  Associative  Memory  8  UP  Reals  Sent  vs.  Nulls  Sent  Message  Analysis 
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Figure  49.  Associative  Memory  6  LP  Reals  Sent  vs.  Nulls  Sent  Message  Analysis 
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Figure  50.  Associative  Mcrnwy  6  LP  Total  Messages  Sent  vs.  Output  Arcs 
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Figure  51.  Associative  Memory  6  LP  Total  Messages  Sent  vs.  LP  Output  Lines 


Table  2.  Predicted  vs.  Actual  Effect  of  Increased  Lookahead 
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5.4.3  Increasing  Lookahead.  Sections  3.3.2.3  and  3.3.2.4  discuss  the 
relationship  between  the  average  lookahead  in  the  ipx.arcs  file,  the  null  message 
overhead,  and  overall  simulation  performance.  This  section  uses  the  Wallace  tree  SBF 
partitions  to  present  a  quantitative  example  to  illustrate  these  relationships.  To  make  the 
comparison,  the  SBF  ipx.arcs  files  were  modified  to  contain  the  Wallace  tree  normal 
lookahead  value  of  2  ns  for  all  LP  output  lines.  The  simulations  were  then  re-run  for 
comparison  with  the  original  SBF  results  with  the  increased  lookahead. 

In  section  3.3.2.4,  an  assumption  is  made  that  the  logical  delay  value  for  an  LP  output 
line  will  only  be  used  to  determine  the  timestamp  of  a  null  message  approximately  50% 
of  the  time.  This  assumption  is  then  used  as  the  basis  for  a  modified  calculation  of  Lan;s> 
which  estimates  the  impact  of  the  average  lookahead  value  on  the  null  message  overhead. 
Table  2  presents  data  for  the  Wallace  tree  SBF  partitions  to  demonstrate  the  validity  of 
this  assumption.  Specifically,  it  uses  the  modified  equation  for  Lares  to  calculate  the 
expected  number  of  null  messages  and  compares  it  against  the  actual  number  of  null 
messages.  The  value  for  expected  null  messages  is  calculated  by  multiplying  the  value  of 
Lares  hy  the  number  of  null  messages  transmitted  when  all  logical  delays  in  the  ipx .  arcs 
file  were  set  to  their  normal  value  (2  ns).  As  seen  from  the  table,  the  actual  number  of 
null  messages  was  within  10%  of  the  expected  value  in  a  majority  of  the  cases,  and  was 
within  20%  in  all  cases. 


121 


Figure  52  shows  the  comparison  between  the  SBF  partitions  with  the  naturally 
increased  lookahead  and  the  SBF  partiti<Mis  in  which  all  lookahead  values  have  been  set 
to  the  normal  value  of  2  ns.  Specifically,  it  shows  the  comparison  in  speedup,  average 
lookahead,  and  null  message  traffic.  As  can  be  seen  from  the  figure,  the  increase  in 
lookahead  had  a  consistent  effect  of  decreasing  the  null  message  overhead  and  increasing 
the  speedup.  However,  the  net  effect  on  speedup  was  essentially  negligible,  and  the 
number  of  null  messages  is  still  proportional  to  the  number  of  LP  output  lines.  Therefcne, 
while  increasing  the  average  lookahead  values  will  help  to  optimize  the  simulation 
performance,  it  does  not  appear  to  be  a  potential  source  of  significant  speedup  gains. 

5.43.1  Calculating  Lookahead.  While  increasing  the  average  lookahead 
reduces  the  null  message  oveihead,  calculating  the  correct  lookahead  values  is  a  potential 
computational  bottleneck.  The  maximum  lookahead  value  for  an  LP  output  line  is  defined 
as  the  minimum  path  from  all  inter-behavior  arcs  entering  the  LP  (and  all  source 
behaviors  in  the  LP)  to  all  inter-behavior  arcs  exiting  the  LP  that  correspond  to  the  given 
LP  output  line.  The  minimum  path  is  defined  as  the  sum  of  the  logical  delays  of  the 
behaviors  on  the  path. 

The  current  algorithm  used  by  GP-Tool  is  a  recursive  algorithm  that  begins  at  each 
input  arc  to  the  current  LP  (LPa)  and  traverses  all  possible  paths  through  the  LP  until  it 
reaches  an  external  LP,  tracking  the  length  of  the  current  path  (in  terms  of  logical 
behavior  delays)  along  the  way.  When  the  current  path  reaches  an  external  LP  (LPb),  the 
minimum  path  on  record  from  LPg  to  LP^  is  compared  to  the  length  of  the  current  path 
and  updated  if  necessary.  The  process  is  repeated  for  each  input  arc  and  each  source 
behavior  for  each  LP.  This  algorithm  has  proven  to  be  the  most  efficient  method  of 
calculating  the  lookahead  values  in  most  situations.  Two  exceptions  are  discussed  in  the 
next  section. 
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Figure  52.  Effect  of  Increased  Lookahead  on  Wallace  Tree  SBF  Partitions 


5.432  Lookahead  Anomalies.  This  first  ancmialy  related  to  the  calculatitm 
the  average  lookahead  values  for  the  ipx .  arcs  files  (teals  with  the  ranning  time  of  tlM 
recursive  .arcs  routine  discussed  in  the  previous  section.  Uiuter  certain  circumstances,  this 
algorithm  can  be  extremely  inefficient.  Spetnfically,  calculating  the  Itxricahead  values  for 
the  Wallace  tree  circuit  with  2-4  LPs  can  take  30  minutes  or  more.  This  (Ktnirs  because 
the  SDF  partitioning  algorithm  results  in  partitions  in  which  the  first  LP  contains 
relatively  long  paths  of  dependent  behavi(Hs.  This,  in  turn,  increases  the  number  of  paths 
through  the  LP  as  well  as  increasing  the  level  of  recursion  required  by  the  algorithm. 

Comparison  with  other  algorithms,  however,  indicates  that  the  increased  computation 
time  is  out  of  proportion  to  the  longer  paths  caused  by  the  SDF  partition.  For  example, 
Dijkstra’s  shortest  path  algorithm,  which  finds  the  shortest  path  from  a  given  behavi(}r  to 
all  other  behavicns,  runs  much  faster  than  the  anomolous  recursive  case.  The  problem  was 
not  experienced  on  the  ass(x;iative  memory  circuit,  which  is  mtre  than  4  times  larger  than 
the  Wallace  tree  multiplier.  Furthermore,  as  the  number  of  LPs  was  increased,  the 
computation  bottleneck  for  the  Wallace  tree  SDF  partitions  decreased. 

An  alternative  method  for  calculating  the  correct  lookahead  values  based  upon 
Dijkstra’s  shortest  path  algorithm  was  implemented  as  well  (although  it  is  not  in  the 
current  version  of  GP-T(X>1).  For  each  input  arc  and  source  behavior  in  LPa,  this  approach 
uses  Dijkstra’s  algorithm  to  calculate  the  shortest  path  to  all  other  behaviors  in  the 
system.  Not  all  of  these  paths  are  relevant  to  the  lookahead  value.  For  example,  the  path 
from  behavior  i  in  LPa  to  behavior  j  in  LP^  is  of  no  interest  if  there  is  no  direct  connection 
from  LPa  to  LPc.  Therefore,  this  algorithm  involves  significant  extraneous  computations. 
Although  this  algorithm  is  independent  of  the  number  of  LPs  or  the  quality  of  the 
partition,  an  additional  disadvantage  is  that  the  algorithm  is  proportional  to  N(N-l), 
where  N  is  the  number  of  behaviors  in  the  system.  In  general,  this  algorithm  was  less 
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efficient  than  the  recursive  algorithm,  except  for  those  few  cases  in  which  the  recursive 
algorithm  experienced  the  anomalous  computaticmal  bottlenecks. 

The  second  anomaly  related  to  the  calculation  of  the  lookahead  values  for  the 
ipx.arcs  nies  involved  the  associative  memory  8  LP  SBF  partition.  Specifically,  the 
recursive  algorithm  produced  lookahead  values  which  caused  message  out-of-txdo*  errors 
during  simulation.  This  occured  when  the  lot^cahead  values  in  the  ipx .  area  file  were  too 
large,  causing  the  transmission  of  null  messages  with  timestamps  that  were  too  large. 
Theoretically,  calculating  the  lookahead  values  as  described  in  the  previous  section 
should  prevent  this  from  occurring.  Analysis  has  failed  to  find  any  errors  in  the  algorithm, 
and  the  problem  was  not  observed  on  any  other  circuii/partition/LP-value  ctMnbinaticm.  It 
was  manually  resolved  by  reducing  the  lookahead  values  in  the  ipx .  arcs  file  in  small 
increments  until  the  out-of-order  errors  disappeared. 

5J  AB  Border  AnneaUng  Algorithm 

One  of  the  primary  objectives  of  this  thesis  research  was  to  make  the  partitioning 
strategy  implemented  efficient  in  terms  of  required  computation  time.  As  discussed  in 
chapter  4,  the  annealing  algorithm  continues  until  a  maximum  number  of  iterations  have 
been  executed,  or  until  a  specifled  number  of  consecutive  iterations  have  been  executed 
with  no  net  improvement  in  the  cost  function.  Under  the  current  implementation  of  GP- 
Tool,  there  is  no  option  for  setting  the  tolerance  for  measuring  changes  in  the  cost 
function.  Rather,  changes  are  measured  to  the  precision  provided  by  the  standard  floating 
point  data  type  used  in  the  Ada  compiler.  As  a  result,  the  annealing  process  will  continue 
as  long  as  minor  itiq)rovements  are  being  made. 

For  example,  Figure  S3  shows  the  partition  statistics  for  the  8  LP  Wallace  tree  AB2 
annealing  partition  as  they  vary  over  the  iterations  of  the  AB  border  annealing  algorithm. 
Recall  that  in  this  partition,  the  option  ignore_connn_Di3t_Factor  was  set  to  true.  Thus, 
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Hd  is  allowed  to  increase  in  an  effort  to  maximize  the  reductions  in  He  and  Hq.  Aldiough 
the  algorithm  executes  168  iterations  before  terminating,  it  is  clear  that  minimal 
improvement  was  made  after  the  first  35  iterations.  This  type  of  behavior,  in  which  the 
majority  of  the  improvement  was  made  in  the  first  few  iterations,  is  typical  for  the  test 
cases  performed  to  date. 

The  AB  bender  annealing  algcnithm  has  not  been  instrumented  to  allow  for  detailed 
timing  measurements.  However,  the  entire  AB  annealing  inocess  for  this  test  case  took 
approximately  59  sec  to  complete,  for  an  average  of  0.35  sec  per  iteratitm.  It  should  be 
noted  that  not  all  iterations  will  require  the  same  amount  of  computational  time.  For 
example,  as  the  state  of  the  partition  is  imfnoved  through  successive  iterations,  fewer 
behaviors  will  be  queued  for  reassignment  consideration.  Thus,  as  the  algorithm 
progresses,  the  time  per  iterati<m  will  show  a  decreasing  trend. 

As  a  point  of  ctnnparison,  rough  measurements  were  taken  on  the  time  required  to 
produce  a  random  partition^'^  and  an  SDF  partition  for  both  the  Wallace  tree  and  the 
associative  memory  with  8  LPs.  For  the  Wallace  tree,  the  random  partition  took 
approximately  7  secs,  while  the  SDF  partition  took  less  than  1  sec.  For  the  associative 
memory,  the  random  partition  took  approximately  71  secs,  while  the  SDF  partition  ux^ 
less  than  3  secs. 

5.6  Increasing  the  Number  of  Processors 

The  results  presented  above  were  limited  to  an  8-node  iPSC72  hypercube.  In  order  to 
validate  the  results  on  a  larger  number  of  piocessors,  the  Wallace  tree  circuit  was  run  on 
an  iPSCy860  using  up  to  32  nodes.  The  simulation  speedup  results,  partition  statistics,  and 
inter-LP  message  counts  are  presented  in  Figures  54  to  56  respectively. 


Note  that  the  random  partitioning  algorithm  was  not  written  for  maximum  efficiency. 
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Figure  S3.  Wallace  Tree  8  LP  Partition  Statistics  vs.  AB  Border  Annealing  Iterations 
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Figure  54.  iPSC/860  Wallace  Tree  Speedup  Results  Comparison 


As  can  be  seen  from  Figure  54,  the  increased  processing  power  of  the  i860  processors 
inovides  an  order  of  magnitude  speedup  over  the  iPSC/2  for  the  single  LP  case.  Because 
single  LP  case  runs  so  much  faster  on  the  i860,  it  is  much  more  difficult  to  obtain 
speedup  with  multiple  processors.  For  exanqile,  the  speedup  for  the  random  partition  was 
less  than  1.0  for  all  LP  numbers  greater  than  1,  and  the  maximum  speedup  obtained  was 
1.9  on  4  LPs  for  the  AB  annealing  partition.  Despite  these  differences,  the  patterns 
relating  the  partition  statistics  to  the  inter-LP  message  traffic  and  speedup  are  identical  to 
those  for  the  iPSC/2  results. 
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VI.  Conclusions  and  Recommendations 


6.1  Research  Summary 

As  modem  integrated  circuit  designs  grow  larger  and  more  con^lex,  the  time 
required  to  perform  sequential  VHDL  simulations  becmnes  more  burdensmne.  In  order  to 
execute  complex  simulations  in  a  reasonable  amount  of  time,  a  parallel  VHDL  simulator 
should  be  used  to  simulate  hierarchical  structural  VHDL  circuits.  Parallelism  is 
introduced  by  partitioning  the  circuit  behavicars  among  the  available  processors  to  form 
logical  processes  (LPs).  Signal  changes  are  shared  anxxig  the  LPs  by  event  messages. 

Parallelism  by  itself,  however,  fails  to  provide  satisfactory  speedup  results  due  to  the 
overhead  required  to  omununicate  signal  changes  and  maintain  synchiXMiization  between 
LPs.  The  amount  of  overhead  is  directly  dependent  on  how  the  circuit  behaviors  are 
partitioned  among  the  logical  processes. 

In  this  research  effort,  a  circuit  behavior  intor-dependency  structure  is  extracted  from 
the  first  iteration  of  VSIM’s  sequential  simulatkm  cycle.  This  information  is  used  to  build 
a  graph  representing  the  stracture  of  the  circuit  being  simulated  with  the  circuit  behavima 
as  vertices  and  their  inter-dependencies  as  directed  arcs.  Using  various  graph  traversal 
techniques  to  account  for  the  circuit  inter-dependency  stmcture,  the  circuit  is  divided  inu> 
the  desired  number  of  LPs.  A  bonder  annealing  algorithm  is  then  employed  to  refine  the 
quality  of  the  partition  by  selectively  reassigning  behaviors  to  different  LPs. 

Two  relatively  large  circuits  (an  8x8  Wallace  tree  multiplier,  and  a  16x16  associative 
memory  array)  have  been  used  as  subjects  on  which  to  test  partitioning  techniques. 
Speedup  results  are  compared  to  those  produced  by  a  random  partitioning  of  the  circuit 
behaviors. 

As  an  aid  in  making  reassignment  decisions  during  the  border  annealing  process,  an 
objective  cost  function  is  formulated  in  order  to  measure  the  quality  of  a  given  partition. 
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This  cost  function  is  designed  to  account  for  the  additkmal  communications  overhead 
resulting  from  the  conservative  null  message  n>ES  synchronization  protocol.  In  addititm, 
an  atteiiq)t  is  made  to  account  for  unevenness  in  the  distribution  of  the  ccmimunications 
overhead.  Finally,  an  attempt  is  made  at  quantifying  the  relationship  between  the  quality 
of  a  partition  as  measured  by  the  objective  cost  function  and  the  resulting  simulation 
performance. 

Conclusions 

The  following  general  conclusions  can  be  made  about  partitioning  hierarchically  built 
structural  VHDL  circuit  simulations: 

•  Deliberate  partitioning  schemes  improve  simulation  speedup.  The  primary 
research  objective  of  demonstrating  im|m>ved  speedup  over  random  partitioning 
was  accomplished.  This  research  has  found  that,  in  general,  a  deliberate 
partitioning  algorithm  which  accounts  for  the  complex  inter-dependency 
relationships  of  the  circuit  behaviors  will  tend  to  reduce  the  communications 
overhead  and  improve  the  simulation  performance. 

•  The  partition  cost  function  must  account  for  more  than  load  imbalance  and  the 
number  of  inter-LP  arcs.  As  stated  in  the  research  objectives,  an  effort  was  made 
to  determine  a  meaningful  method  of  measuring  the  cost  of  a  partition.  Data 
analysis  shows  that  as  the  number  of  LPs  is  increased,  the  null  message  traffic  due 
to  the  conservative  PDFS  synchronization  protocol  begins  to  dominate  the  inter¬ 
processor  communications  overhead.  Reducing  the  real  message  traffic  by 
reducing  the  inter-behavior  arcs  which  cross  LP  boundaries  serves  to  enhance  the 
dominance  of  the  null  message  overitead.  Not  accounting  for  the  null  message 
overhead  will  give  an  inaccurate  picture  of  the  quality  of  a  given  partition. 
Therefore,  null  message  synchronization  overhead  must  be  accounted  for  in  the 
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partition  cost.  In  addition,  this  research  indicates  that  the  distribution  of  the 
remaining  communications  among  the  LPs  also  impacts  the  performance  of  the 
simulation.  Additional  research  is  needed  in  order  to  determine  the  exact  nature  of 
this  latter  relationship. 

•  Null  message  overhead  is  directly  proportional  to  the  number  of  arcs  in  the  LP 
connectivity  graph.  Results  have  clearly  shown  that  the  number  of  null  messages 
requited  to  maintain  synchronization  is  directly  proportional  to  the  number  of  arcs 
in  the  LP  connectivity  graph  (referred  to  as  “LP  output  lines”).  By  decreasing  the 
connectivity  between  LPs  using  a  deliberate  partitioning  scheme,  it  is  possible  to 
rtxiuce  the  null  message  oveihead  and  improve  simulation  performance.  However, 
it  appears  as  though  the  best  method  for  reducing  null  message  overhead  is  to 
avoid  imposed  feedback  among  the  LPs  (i.e.  make  the  LP  connectivity  graph 
acyclic). 

•  Further  reductions  in  real  message  overhead  will  have  negligible  impact  on 
simulation  performance.  The  partitioning  algorithms  used  in  this  thesis  research 
have  made  significant  reductions  in  the  amount  of  real  message  communications 
overhead  compared  to  a  random  partitioning  of  the  circuit  behaviors.  The  null 
message  overhead  is  also  reduced,  but  by  a  much  smaller  margin.  As  a  result,  the 
inter-processor  communications  overhead  is  dominated  by  the  null  message  traffic 
for  a  relatively  small  number  of  IPs.  The  problem  is  exacerbated  as  the  number  of 
LPs  is  increased.  Further  reductions  in  real  message  traffic  without  significant 
reductions  in  null  message  traffic  will  have  a  negligible  impact  on  the  total  inter- 
processor  message  traffic. 

•  The  proposed  partition  cost  function  provides  an  accurate  means  for  comparing 
the  relative  quality  of  different  partitions.  With  few  exceptions,  it  was  shown  that 
by  relating  the  partition  cost  to  the  expected  simulation  speedup,  the  proposed 


133 


partition  cost  function  could  correcdy  predict  the  relative  perfonnance  tmlering  of 
the  various  partitioning  schemes  used. 

•  The  proposed  AB  border  annealing  partitioning  algorithm  provides  an  effective 
means  of  iteratively  improving  a  partition.  Data  analysis  shows  a  consistent 
reduction  in  the  number  of  inter-LP  arcs  and  LP  output  lines  by  using  the  AB 
border  annealing  algorithm.  However,  due  to  the  continued  dominance  of  the  null 
message  overhead,  the  corresponding  simulation  performance  improvement  is 
often  insignificant. 

6.3  Recommendations  for  Further  Research 

6.3.1  Circuit  Partitioning  Recommendations.  Significant  progress  has  been  made 
in  this  thesis  research  towards  achieving  improved  simulation  speedup  through  a 
deliberate  circuit  partitioning  strategy.  Some  suggested  areas  of  research  for  expanding 
upon  this  progress  are: 

•  Eliminate  imposed  feedback  among  LPs.  By  producing  an  acyclic  LP 
connectivity  graph,  circular  waiting  among  LPs  will  be  eliminated.  This  will 
reduce  the  amount  of  LP  blocking  as  well  as  the  the  null  message  overhead.  The 
suggested  methodology  to  produce  an  acyclic  initial  partition  (treating  strong 
components  as  indivisible  blocks),  and  modify  the  objective  cost  function  so  that 
LP  feedback  is  not  introduced  during  the  border  annealing  process. 

•  Continue  exploring  relationship  between  the  distribution  of  the  inter-processor 
communications  and  the  simulation  performance.  Data  from  this  research  has 
shown  that  an  uneven  distribution  of  the  inter-processor  communications  may 
have  a  negative  impact  on  simulation  performance.  The  exact  nature  of  this 
relationship  is  still  undeHned.  It  is  possible  that  the  elimination  of  feedback 
among  LPs  may  negate  the  effects  of  uneven  communications  distribution. 
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632  Parallel  Simulation  Recommendations.  There  remains  significant  work  to  be 
done  in  the  future  in  terms  of  the  parallel  VHDL  simulation  application,  VSIM,  and 
synchronization  protocol.  Some  specific  suggestions  for  further  work  ate: 

•  Implement  a  more  optimistic  PDES  synchronization  protocol.  While  it  may  be 
possible  to  gain  additional  speedup  by  modifying  the  rules  for  sending  null 
messages,  it  is  unlikely  that  such  gains  will  be  significant.  One  recommended 
approach  involves  the  use  of  “local  rollbacks”  (1 1:22).  Under  this  approach,  each 
LP  maintains  a  safe-time  as  in  the  current  null  message  approach,  but  is  allowed 
to  process  events  as  fast  as  possible,  potentially  advancing  its  local  simulation 
clock  past  its  safe-time.  However,  real  messages  with  timestamps  larger  than  the 
safe-time  are  not  sent,  but  are  buffered  until  it  is  safe  to  transmit  them.  Although 
state  saving  is  required  in  this  scheme,  it  has  the  advantage  that  rollbacks  to  prior 
states  are  local  to  the  LP  receiving  an  out-of-order  message.  There  is  no  need  for 
anti-messages  to  counteract  prematurely  transmitted  teal  messages.  Under  this 
scheme,  the  only  messages  in  the  simulation  will  be  real  messages  which  transmit 
actual  signal  change  information.  If  desired,  a  limit  can  be  placed  on  how  far  past 
the  safe-time  an  LP  is  allowed  to  advance.  This  time  window  will  limit  the 
amount  of  state  saving  overhead,  but  must  be  chosen  large  enough  to  prevent 
circular  waiting  in  feedback  loops  among  the  LPs.  Numerous  alternative 
synchronization  protocols  are  possible  as  well  (e.g.,  conservative  time  windows, 
time  warp,  lazy  cancellation,  optimistic  time  windows,  etc.  (1 1)). 

•  Expand  the  VHDL  subset  supported  by  VSIM.  Currently,  VSIM  supports  a  very 
limited  subset  of  the  standard  VHDL  language.  This  has  the  effect  of  limiting  the 
number  of  circuits  that  can  be  built  and  simulated  in  parallel  with  a  reasonable 
expenditure  of  programming  resources.  Breeden  suggests  methods  for 
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implementing  two  key  enhancements:  resolution  functions  and  wait  statements 
(4:83-84).  Other  suggested  enhancements  include  complex  procedures,  bit 
vectors,  and  multi-valued  logic  (MVL)  data  types. 

•  Improve  the  basic  VSIM  simulation  cycle.  The  basic  simulation  cycle 
implemented  in  VSIM  suffers  from  large  overheads  and  poor  performance. 
Current  state-of-the-art  sequential  VHDL  simulators  available  ccmimercially  can 
simulate  a  circuit  many  times  faster  than  the  sequential  versicm  of  VSIM.  One 
potential  source  of  improvement  is  with  improved  list  management.  Detailed 
instrumentation  of  the  simulation  cycle  may  lead  to  the  discovery  of  other  sources 
of  potential  improvement. 

•  Improve  VSIM's  postprocessor.  Breeden  recommends  several  options  for 
improving  the  postprocessor  which  transforms  the  sequential  Intermetrics  code 
into  models  compatible  with  VSIM  (4:83). 

•  Implement  selective  output  report  generation.  Chirrently,  die  only  opticm  for 
generating  simulation  output  in  VSIM  is  to  report  all  signal  changes  in  the 
simulation  one  at  a  time.  For  large  circuits,  verifying  the  correctness  of  such 
output  files  is  infeasible.  As  a  minimum,  an  ability  should  be  provided  to  allow 
selection  of  which  signals  to  include  in  the  output  report.  Preferably,  formatted 
output  files  such  as  those  produced  by  Intermetrics  should  be  added. 
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Appendix  A.  Acroi^ms  and  Definitioits 


A.1  Glossary  cf Acronyms 


AFIT 

ARPA 

IVAN 

LP 

MFA 

PDES 

SA 

SBF 

SDF 

SDP 

SGE 

SPECTRUM  - 

TIG 

TPG 

VHDL 

VHSIC 

VLSI 


Air  Force  Institute  Techmriogy 

Advanced  Research  Projwts  Agency 

Intermediate  VHDL  Attributed  Notation 

Lo^al  Process 

Nfeiui  Field  Annealing 

Parallel  Discrete  Event  Simulation 

Simulated  Annealing 

Siii^)le  Breaddi-First  pamtioning 

Simple  Depth-First  (M^timung 

Simple  Date  Partititeiing 

Syn(q>sis  Gti^hic  Editcv 

Simulation  Protocol  Evaluadtei  on  a  Concurrent  Testbed  using 

Reusable  Modules 

Ta^  Interaction  Graph 

Task  Precedence  Gn^h 

VHSIC  Hardware  DescriptitHi  Language 

Very  Hi^  Speed  Integral  Circuit 

Very  Ls^  &ale  Integrated 


A  2  Definitions 

Activity  -  The  state  of  an  entity  over  an  interval  of  time  (18:135).  For  example,  the 
activity  of  a  signal  is  defined  as  the  sequence  of  state  changes  for  that  sigiud  over  a  givra 
time  period. 

Behavior  -  In  VHDL,  a  behavior  is  an  executable  process  representing  a  logic  gate, 
input  signal,  output  signal,  or  other  simple  VHDL  process. 

Design  Hierarchy  -  In  VHDL,  the  design  Merarchy  represents  the  successive 
deconqrosition  of  a  design  entity  into  conqronents,  binding  those  components  to  other 
design  entities  that  may  (kctnnposed  in  a  similar  manner.  Collectively,  they  represent 
a  complete  design  and  are  referred  to  as  a  design  hierarchy  (8:2-1 1). 

Bvant  -  An  activity  that  causes  a  change  in  the  state  of  the  simulation  nxxlel  (11).  In 
the  context  of  this  thesis,  a  simulation  event  is  defined  as  the  changing  of  a  signal  value 
fiom  one  state  tt)  another. 

Hassago  -  A  message  is  the  mechanism  used  by  processes  to  communicate  the 
modified  state  information  caused  by  a  simulation  event.  In  this  thesis,  the  term  message 
implies  communications  between  logical  processes,  and  thus,  corresponds  to  inter¬ 
processor  communications. 

Hodai  -  An  abstract  representation  of  a  physical  system  (1).  A  nxxlel  consists  of 
entities  and  their  inter-relationships  (18:135). 
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ProcMs  -  A  succession  of  entity  states  over  a  contiguous  tune  period  (18:136).  A 
logical  process  (LP)  is  the  noodel's  representation  of  a  physical  process  (PP)  in  the  Systran 
(7:198-199). 

Signal  -  In  VHDL,  a  sigi^  represents  an  object  that  holds  a  value  and  cwresprads 
diiratly  to  a  metal  interconnection  within  a  circuit  (8:2-12).  Signals  define  the  pathways 
among  VHDL  processes  (i.e.  behaviors)  (15:9). 

stat*  -  The  sum  of  all  variables  describing  an  entity  at  a  given  instant  in  time 
(18:135). 

Systam  A  real-world  process  being  modeled  and  simulated  (1). 
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Appendix  B.  AFTT  Parallel  VHDL  Simulation  User's  Guide 


B.l  Overview. 

To  execute  a  parallel  VHDL  simulation,  the  VHDL  circuit  is  first  ctmopiled  widi  the 
Intermetrics  VlffiL  simulator.  The  intermediate  C  code  is  then  intercepted  and 
transfOTined  into  models  conq)atible  with  AJOT’s  parallel  VHDL  simulator  -  VSIM. 
VSIM  has  a  sequential  mode  that  can  be  executed  on  a  single  processor  system,  and  a 
pmllel  mode  that  runs  on  the  Intel  iPSC/2  and  iPSC^860  Hypercubes  (4).  lliis  appendix 
discusses  the  general  process  for  successfully  executing  a  parallel  VITOL  simitiation 
using  VSIM  a^  then  iUustrates  this  process  through  a  step-by-step  exanqtle. 


B.1 .1  Required  Files.  Figure  S7  shows  the  location  of  the  baselined  versions  of  all 
VSIM  related  files  other  than  the  VHDL  circuit  specific  files.  The  files  identified  in  bold 
are  required  to  compile  and  execute  the  parallel  version  of  VSIM.  The  files  attached  by 
dashed  lines  are  executable  utility  routines  used  in  the  code  transformation  process.  These 
routines  are  actually  run  on  the  SPARC,  but  are  archived  on  the  iPSC/2  ^ong  with  the 
rest  of  the  VSIM  related  files. 

With  the  exertion  of  the  file  application .  h,  all  of  the  source  files  listed  in  Figure 
57  can  be  compiled  out  of  the  archive  directones.  The  file  appiication.h  must  be 
modified  to  identify  the  desired  number  of  LPs  in  the  simulatitm,  and  should  be  cq>ied  to 
and  compiled  from  the  user’s  local  directory. 

A  brief  description  of  each  source  file  is  listed  below  (4:91-92). 

•  vsim.h  -  Header  file  for  vinit .  c,  vsim.c,  vtools .  c,  and  vspec .  c; 

modeled  after  Intennetrics’  simuti .  h. 

•  vsim.c  -  VSIM  main  simulatitm  loop  ami  associated  functions. 


CUBE386: 

Aisr 


/simuUue 


/spclctram 

idl 

r 

1 

1 

1 

r ' 

- 1 

/afit 

/filters 

/vsim 

/roectrum 

/pbiiild 

/mapping 

r-^ 

“1 

1 

vhdldocks.c 

1 

vinitc 

1  ! 
applkationJi  pbuild 

1 

vmap 

1 

lp_man.c 

cube2.c 

hQst2.c 

1 

/include 

1 

cubeUi 

VSiULC 

vsimJi 

vspec.c 

vtoois.c 

globalsJi 

plex 

(+ source 
files) 

(-•-source 

files) 

Figure  57.  Location  of  Archived  VSIM  files  on  AFTT’s  iPSC/2  Hypercube 
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•  vinit.c  -  VSIM initialization  routines. 

•  vtoois.c  -  Functions  for  ivinting  VSIM  state  variables  and  queues  to  assist  in 

code  maintenance  (cmupiladon  is  optional). 

•  vspec.c  -  Functions  that  provide  the  interface  between  VSIM  and 

SPECTRUM. 


•  giobais.h  Standard  SPECTRUM  header  file.  Modified  to  redefine  the 

event  structure. 

•  appiication.h  -  Contains  application-specific  global  information  for 

SPECTRUM  and  vspec.c,  and  is  included  by  giobais.h. 
Specifies  the  number  of  LPs  for  a  particular  simulatiiMi  run. 

•  ipjnan.c  -  Contains  SPECTRUM’S  LP-level  functions. 

•  cube2.c  -  Provides  interface  between  lp_man.c  and  the  operating  system. 

•  cube2.h  -  Header  file  for  cube2 .  c  and  host2 .  c . 


•  host2 .  c  -  Host  program  which  loads  the  nodes  and  starts  the  simulation. 


B.12  Process.  There  are  seven  basic  steps  involved  in  the  running  of  a  parallel 
VHDL  simulation  with  VSIM  (4:90): 

1.  Develop  the  original  VHDL  source  code  describing  the  circuit  to  be  simulated  and 
the  testbench  to  be  used  to  verify  the  circuit  design.  The  VHDL  source  code  must 
comply  with  the  subset  of  VHDL  supported  by  VSIM  as  described  by  Breeden 
(4). 

2.  Perform  the  Compile,  Model  Generate,  and  Build  phases  of  the  Intermetrics 
sequential  simulation. 

3.  Using  VSIM’s  postprocessor,  pbulid,  transform  the  Intermetrics  generated  source 
files  into  VSIM  compatible  source  co^. 

4.  Compile  and  run  the  sequential  version  of  VSIM  in  order  to  define  the  behavior 
dependency  relationships. 

5.  Using  the  VSIM  utility  vmap,  extract  the  behavior  dependency  relationships  from 
the  ouq>ut  of  the  sequential  VSIM  simulation  and  generate  a  .vmap  output  file. 

6.  Using  the  VSIM  Graph-Partitioning  Tool  (gp-tooi),  read  in  the  .vmap  file 
generated  in  the  previous  step  and  generate  a  logical  process  dependency  file 
(ipx .  arcs)  and  a  behavior-to-LP  mapping  file  (ipx .map). 

7.  Compile  and  run  the  parallel  version  of  VSIM  on  the  Hypercube. 

The  remainder  of  this  appendix  discusses  these  seven  steps  in  more  detail  and 
concludes  with  a  step-by-step  example. 
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B2  Generating  the  VHDL  Source  Files. 


B2.1  Generating  the  VHDL  Source  Code.  Step  one  in  the  VSIM  simulaticm 
process  is  to  create  the  VHDL  source  codb  describing  the  circuit  to  be  simulated.  VSIM 
can  only  support  structural  circuit  descriptimis  and  siiiq>le  VHDL  processes.  Specific 
limitaticms  are  (4:93): 

•  Signals  can  only  be  of  type  bit  or  bit -vector . 

•  Bit-vector  signal  inputs  must  be  described  one  bit  at  a  time  (e.g.,A(i)  <-  'O' 
after  3  ns). 

•  In  general,  processes  must  be  one  line  descripdtxis  (e.g.,  our  <•  ini  xor  in2 
after  delay).  However,  a  multi-line  process  (delimited  by  begin  and  end 
process)  may  be  used  if  it:  1)  waits  on  all  signals,  or  2)  terminates  after  the  first 
use. 

•  Support  for  VHDL  functions  and  procedures  has  not  been  innplemented  in  VSIM. 
This  means  that  multi-valued  logic  (MVL)  signals  and  bus  resolution  functions 
are  not  supported.  Reference  (4:93)  for  more  information. 


B22  Establishing  an  Intermetrics  User  Library.  Step  two  in  the  VSIM  simulation 
process  involves  use  of  the  Intermetrics  commercial  simulator  to  compile  the  VHDL 
source  code  and  create  the  sequential  simulation  models.  The  Intermetrics  compiler  is 
located  on  vuican  in  the  parallel  simulation  laboratory.  Before  establishing  a  personal 
library,  the  environment  variables  shown  in  Figure  58  must  be  included  in  the  user’s 
.cshrc  file. 

The  next  step  is  to  create  the  user’s  individual  work  library  using  the  sequence  of 
commands  shown  in  Figure  59.  These  conunands  must  be  executed  on  vuican, 
substituting  the  user’s  own  id  for  “kkapp”  (user  entries  in  bold). 

The  user’s  work  library  will  only  need  to  be  created  one  time.  Once  created  for  the 
first  circuit,  this  step  may  skipped  for  future  circuit  simulations. 


B2J  Compiling,  Model  Generating,  and  BuilcUng.  After  a  user  library  has  been 
created,  the  next  steps  are  to  compile  the  VHDL  source  code  creating  an  IVAN  file  as 
output.  The  interm^ate  C  source  code  required  by  VSIM  is  ontained  by  running  the 
Intermetrics  riKxlel  generate  routine  on  the  IVAN  file.  Finally,  running  the  Intermetrics 
build  routine  on  the  intermediate  source  files  creates  the  required  compilation  script.  The 


#the  following  lines  are  for  intermetrics  vhdl 

setenv  VHDL_TREE  /usr€/vhdl_restore/inter_vhdl/v2 . 1 

setenv  VHDL_COMMON  /usr6/vhdl_restore/inter_vhdl/v2 . 1/common 

setenv  VHDL_HELP_FILE  /usr6/vhdl_restore/inter_vhdl/v2 . 1/common/help. txt 

setenv  VHDL_LIBROOT  /usr6/vhdl_restore/inter_vhdl/v2 . 1/shiplib 

setenv  VHDL_LIBSIM  /usr6/vhdl_restore/inter_vhdl/v2 . 1/src/simcore/libsim.a 

setenv  VHDL_BIN  /usr6/vhdl_restore/inter_vhdl/v2 . 1/bin 

set  path  =  ($path  /usr6/vhdl_restore/inter_vhdl/v2 . 1/bin) 

Figure  58.  User  .cshrc  Setup  for  Running  Intermetrics 
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vulcan:~>  vis 

Standard  VHDL  1076  Support  Environment  Version  2.1-1  September  1990 
Copyright  (C)  1990  IntermetricS/  Znc.  All  rights  reserved. 

VLS>aalMllb  -dli:«/aar6/^rtidl_r«atox«/inber_irtidl/v2 .  l/ehlplib/kkapp  «kkepp» 
VHDVLS-I-CREATBD_LIB  -  Library  «KKAPP»  successfully  created. 

VLSXlefiM  work  «kkapp» 

VLS>Mtllb  «kkepp» 

VHDVLS-I-DEFAULT_LIBRARY  -  Default  Library  is  «KKAPP». 

VLSXllx 

VHDVLS-I-NO_UNITS  -  No  units  found  in  «KKAPP». 

VLS>«zlt 
vulcan : -> 


Figure  59.  Setting  up  User  Woik  Library  for  Intermetrics 
specific  steps  are  listed  below. 

•  Each  VHDL  source  code  file  must  be  individually  compiled  with  the  Intermetrics 
vhdl  command  (e.g.,  vhdl  or_gate.vhd). 

•  Each  entity/architecture  pair  must  be  “model  generated”  using  the  Inteimetrics  mg 
command  with  the  -debug-cknd  debug  switch  (e.g.,  mg  -debug-cknd 
or_gate  (single) ). 

•  The  top-level  configuration  is  built  using  the  Inteimetrics  build  command  (e.g., 
build  '-debug-cknd  -replace  -ket-assoc_mem  assoc_inem_conf ig' ). 


B2.4  Code  Tranrformation.  The  third  step  in  the  VSIM  simulation  process 
involves  the  transformation  of  the  intermediate  C  source  code  created  by  Intermetrics 
into  models  that  are  compatible  with  VSIM.  Hie  inputs  for  this  phase  are  the  intermediate 
.c  and  .h  files  created  during  the  model  generate  phase,  and  the  compilation  script  created 
during  the  build  phase.  Hie  compilation  script  enables  the  VSIM  postprocessor  to 
determine  the  required  files  and  their  correct  order  of  compilation. 

VSIM’s  postprocessor,  pbuiid,  is  invoked  as  follows: 

pbuild  script  circuit. c 

where  script  is  the  name  of  the  compilation  script  created  during  the  build  phase  (also 
referred  to  as  the  “Kernel  com”  file),  and  circuit  is  the  name  of  the  circuit  being 
simulated. 

The  postprocessor,  pbuiid,  works  by  concatenating  each  of  the  intermediate  .c  files 
into  a  single  file  named  “big_circuit  .c”  and  calling  VSIM’s  lexical  analyzer,  piex,  to 
perform  die  transformation  process,  creating  the  output  file  circuit .  c . 

The  final  step  in  the  code  transformation  is  to  copy  all  of  the  required  header  files  to 
the  same  directory  as  the  circuit.c  file.  These  header  files  will  also  have  to  be 
transferred  to  the  target  parallel  machine  with  the  circuit.c  file  before  executing  the 
parallel  simulation.  The  header  files  can  be  found  in  the  user’s  work  directory 


Unless  the  user  has  comfriled  them  into  another  directory  using  commands  in  the  VHDL  source  code. 
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B.25  Transforming  Large  Circuit  Files.  A  known  problem  with  VSIM’s 
postprocessOT  is  that  the  circuit .  c  file  resulting  from  the  tiansf(»mation  of  large  circuit 
simulations  may  be  too  large  to  ctxnpile  on  the  Intel  Hypercubes.  There  are  two  general 
methods  for  getting  around  this  problem  (4:95): 

•  Run  piex  on  each  individual  Intermediate  C  source  code  frle  created  during  the 
model  generate  phase.  Compile  the  resulting  output  into  separate  object  files  ami 
link  them  togeth^  to  execute  VSIM  on  the  hypercubes. 

•  Construct  the  original  VHDL  circuit  description  using  hierarchical  ccmfigurations. 
This  results  in  a  significant  reduction  in  the  size  of  the  (xmresponding  circuit .  c 
file.  However,  the  postprocessor  is  currently  unable  to  properly  process  the 
#include  directives  necessary  for  the  compilation  of  the  circuit .  c  file.  These 
must  be  manually  inserted  into  the  circuit. c  file,  and  can  be  found  by  an 
examination  of  the  “big  circuit .  c”  file  created  by  the  postprocessm*.  Both  the 
wallace-tree  and  the  associative-memory  were  created  in  this  manner. 

Refer  to  (4:95-97)  for  more  detailed  information. 


B3  Running  Sequential  VSIM. 

Both  the  Intermetrics  simulator  and  the  VSIM  simulator  assign  each  behavior  a 
number  at  run  time.  Therefore,  before  mapping  the  behaviors  onto  the  parallel 
architecture,  VSIM  must  be  run  in  sequential  i^e  in  order  to  determine  the  numbering 
of  the  behaviors  and  the  behavior  inter-dependency  relationships.  This  is  accomplish^ 
by  using  the  following  compile  options  in  the  make^e: 

-osPABC  -r)iiappiMG  -DOtrrpoT 

The  SPARC  option  specifies  that  the  simulation  is  to  be  sequential.  Note  that  in  the 
sequential  version  of  VSIM,  the  SPECTRUM  files  (ip_man .  c,  cube2 .  c ,  cube2 .  h, 
host2.c,  and  vhdiciocka.c)  are  not  required. 

The  MAPPING  option  specifies  that  the  behavior  dependency  relationships  are  to  be 
reported  to  an  output  file.  When  defined,  the  mapping  option  is  automatically  turned  off 
after  the  simulation  time  advances  past  0. 

The  OUTPUT  option  specifies  that  signal  changes  are  also  to  be  reported  to  the  output 
file.  Output  is  not  requir^.  However,  when  running  a  circuit  for  the  &st  time,  it  is  useful 
to  have  output  report^  for  comparison  with  the  Intermetrics  output.  Note  that  VSIM  does 
not  have  a  method  for  selectively  deciding  which  signals  to  include  in  the  output  file  -  it 
is  an  all  or  nothing  option. 

Theoretically,  if  output  is  not  defined,  the  sequential  simulation  can  be  terminated 
after  simulation  time  0  when  all  of  the  required  dependency  information  has  been 
reported  to  the  output  file.  However,  there  is  currently  no  means  implemented  for 
accomplishing  this. 
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B.4  Running  Parallel  VSIM. 


B.4.1  Generating  the  Partition.  Before  running  the  simulaticm  in  parallel,  the 
drcuit  behaviors  must  be  divided  into  logical  processes  (LPs),  each  of  which  will  be 
assigned  to  a  different  processor.  Using  the  mitput  of  the  sequential  simulation  which  was 
created  with  mapping  defined,  the  VSIM  utility  program  vmap  is  used  to  build  a  file 
defining  the  interdependency  relationships  of  the  behaviors  in  the  circuit.  For  example, 
given  the  sequential  output  file  circuit .  out ,  vmap  is  invoked  as  follows: 

vmap  circuit. out  circuit. vmap 
where  circuit .  vmap  is  the  vmap  output  file  with  the  following  format: 

behav_ici  behav_name  behav_delay  [optional  list  of  dependencies] 

An  example  vmap  output  file  is  shown  in  Figure  60.  If  the  output  option  was  defined, 
the  script  sgrep  can  be  used  to  sort  the  output  by  time  and  signal  name  as  follows: 

agtmp  circuit. out  output 

The  sorted  output  data  will  be  in  Ole  output,  and  can  be  compared  to  the  Intermetrics 
simulation  output  in  order  to  verify  correcmess. 

The  vmap  ou^ut  file  is  used  as  input  to  the  VHDL  Graph-Partitioning  Tool  (GP-Tool) 
which  builds  a  directed  graph  from  the  behavior  dependency  information  in  the  file.  GP- 
Tool  is  a  menu-driven  utility  program  which  allows  the  user  to  select  from  a  variety  of 
partitioning  algorithms.  Reference  the  GP-Tool  User’s  Guide  for  detailed  instructions  on 
using  this  utility. 

Of  the  numerous  output  files  created  by  GP-Tool,  the  most  important  are  the  two  files 
that  are  required  for  the  parallel  execution  of  VSIM.  These  are  the  logical  process 
dependency  file  (ipx  .arcs)  and  a  behavior-to-LP  mapping  file  (ipx.map).  The  user  will 
be  prompted  to  enter  these  ^e  names,  and  should  enter  them  exactiy  as  shown  here,  with 
the  “x”  replaced  by  a  numeric  value  specifying  the  number  of  LPs  in  the  partition  (e.g., 
Ip8  .arcs  and  lp8  .map). 

The  specification  for  the  ipx. arcs  file  is  shown  in  Figure  61.  The  ipx.map  file  is  a 
text  file  containing  two  columns  of  numbers.  The  first  column  lists  each  behavior  id 
number,  while  the  second  column  lists  the  corresponding  LP  number  to  which  the 
behavior  is  assigned.  Examples  of  an  ipx. arcs  file  and  an  ipx.map  file  are  shown  in 
Figures  62  and  63  respectively.  Both  of  these  files  are  read  in  at  run  time,  and  must  be  in 
the  same  directory  as  the  VSIM  application. 


9  ET_DFF_TEST_BENCH (STRUCTURAL)  012 
8  ET_DFF_TEST_BENCH (STRUCTURAL)  0  3 
7  ET_DFF (STRUCTURAL)  0 
6  ET_DFF (STRUCTURAL)  0 
5  NAND_GATE (SIMPLE)  3000000  4  7 
4  NAND_GATE (SIMPLE)  3000000  5  6 
3  NAND_GATE( SIMPLE)  3000000  0  2 
2  THREE_INPUT_NAND_GATE (SIMPLE)  3000000  3  5 
1  NAND_GATE (SIMPLE)  3000000  024 
0  NAND_GATE (SIMPLE)  3000000  1 

Figure  60.  Example  VMAP  Output  File 
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ION>N>K>H‘tOOOI->N>0 


#  LP  index 

#  Number  of  input  LPs 

2  #  LP  indices  of  input  LPs 

0  #  Polling  frequencies  of  input  LPs 

0  #  Offset  of  polling  frequency 

#  Number  of  input  lines 

2  #  LP  number  for  each  input  line 

#  Number  of  output  LPs 

3  #  LP  indices  of  output  LPs 

#  Number  of  output  lines 

3  #  LP  index  for  each  output  line 

3000000  5000000  #  Minimum  delays  for  each  output  line 

Figure  61.  Fcvmat  Specification  for  ipx.  a  res  Files  (4:98) 
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Figure  62.  Example  ipx. arcs  File  with  3  LPs 


6  0 
5  0 
4  0 
8  1 
3  1 
7  1 
9  2 
1  2 
0  2 
2  2 

Figure  63.  Example  ipx. map  File  with  3  LPs 


B42  Execute  Parallel  Simulation  on  the  Hypercube.  The  final  step  in  the  VSIM 
parallel  simulation  process  is  to  cc^y  the  necessary  files  to  the  target  platform,  compile 
the  simulatitHi,  and  execute  it  on  the  desired  number  of  nodes.  following  files  are 
needed  on  the  hypeicube: 

•  The  circuit  specific  C  source  code  (i.e.  the  “circuit .  c”  file). 

•  The  header  files  associated  with  the  “circuit .  c"  file;  these  header  files  have  file 
names  of  the  form  fn*«««,  where  #«««  is  a  numeric  value  determined  by  the 
Intermetrics  toolset 

•  The  appropriate  ipx .  arcs  and  ipx .map  files  for  the  desired  circuit  partition. 

•  A  makef  iie  to  compile  the  apprcqniate  VSIM,  SPECTRUM,  ai^  circuit  specific 
files. 

In  addition,  the  header  file  application . h  will  have  to  be  modified  to  define  the  desired 
number  of  LPs.  Thus,  application .  h  will  have  to  be  copied  to,  and  compiled  out  of,  the 
user’s  local  directory.  It  is  a  good  idea  to  also  copy  the  file  giobais .  h  to  the  same  local 
directory.  If  application. h  is  placed  in  a  dir^ory  ~/spectrum  in  the  user’s  main 
directory,  the  VSIM  utility  setips  can  be  used  to  set  the  number  of  LPs  without 
requiring  the  user  to  manually  edit  application  .h.  It  is  invoked  as  follows: 

smtlps  # 

Where  #  is  the  number  of  LPs  desired.  However,  a;^  currently  implemented,  setips  will 
only  woilc  if  the  number  of  LPs  is  ^  9. 

The  simulation  is  now  ready  for  compilation  on  the  hypercube.  Note  that  the  VSIM 
code  will  have  to  be  recompiled  each  time  the  number  of  LPs  is  changed  in 
application  .h,  but  the  circuit  specific  code  will  have  to  only  be  compiled  once.  The 
makefile  should  handle  this  automatically.  Once  compiled  for  a  given  number  of  LPs, 
however,  the  simulation  may  be  run  with  different  partitions  by  replacing  the  ipx. arcs 
and  Ipx  .map  files  with  no  need  to  recompile. 

The  simulation  should  be  compiled  with  the  following  options: 


-tnAPniNG  -UOOTPOT  -DCOmiTS  -UMOMITORCDBB 

The  MAPPING  option  is  undefined  because  the  behavior  inter-dependency  relationships 
are  already  known.  The  output  option  is  also  undefined  because  the  simulation  output 
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creates  a  perfcmmnce  bottleneck.  The  AFTT  research  surrounding  VSIM  has  concentrated 
.n  computational  speedup  and  has  assumed  that  other  research  will  ad^uately  address 
the  problems  caused  by  l^ge  amounts  of  output  data  in  parallel  simulation  applications. 
However,  the  output  option  can  be  defined  if  the  user  desires  to  verify  the  results  of  the 
parallel  simulation. 

When  defined,  the  counts  option  causes  VSIM  to  report  real,  null,  and  total  message 
counts  to  the  IP’s  “log”  in  addition  to  the  normal  timing  information.  If  the  monitorcube 
option  is  defined,  each  LP  will  periodically  report  its  simulation  time  to  the  terminal 
screen  so  that  the  user  can  verify  that  the  simulation  is  progressing.  Other  options 
available  are  the  debug  and  report  options  which  will  report  very  large  amounts  of  data 
to  each  IP’s  “log”  file. 

After  the  compilation  is  complete,  the  simulation  can  be  started  by  invoking  the  host 
program.  The  user  will  then  be  prompted  for  the  name  of  the  circuit  program  to  load,  the 
command  line  parameters  (simulation  end  time  in  ns),  the  number  of  nodes  to  use,  and  the 
number  of  LPs  in  the  application  (one  LP  per  node).  If  the  number  of  LPs  entered  does 
not  match  the  number  in  application .  h,  the  program  exits  with  an  error  message. 

After  the  simulation  is  completed,  the  simulation  timing  data  will  be  in  a  series  of 
“log”  files  -  one  for  each  LP  (e.g.  logO,  logi,  iog2,  ...).  These  can  be  concatenated 
together  to  provide  a  summary  report  for  the  simulation  run.  If  output  was  defined,  each 
IP’s  output  data  will  be  in  an  ipx .  out  Hie  (where  x  is  the  IP  number).  These  files  can  be 
concatenated  and  then  sorted  with  the  sgrep  utility  to  provide  a  file  that  can  be  compared 
with  the  sequential  simulation  output. 


BJ  Step-by-Step  Example. 

This  section  illustrates  the  VSIM  parallel  simulation  process  with  a  step-by-step 
example  using  the  edge-triggered  D  flip-flop  (et_df  f )  as  the  example  circuit. 

B.5.1  Develop  VHDL  Source  Code.  The  specific  rules  and  limitations  for 
developing  the  original  VHDL  source  code  are  discussed  in  section  B.2.1.  Refer  to  (4)  for 
more  detailed  information.  The  VHDL  source  code  files  for  the  et  dff  circuit  are 
archived  on  the  PSC/2  (cube386)  in  the  directory  ~/u3r/3imuiate/vhdi/et_dff ,  and 
all  end  in  a  .vhd  extension. 


BJ.2  Compile,  Model  Generate,  and  Build.  In  this  example,  it  is  assumed  that  the 
user  has  already  set  up  an  Intermetrics  woik  library  as  described  in  section  B.2.2.  In  order 
to  compile,  model  generate,  build,  and  simulate  the  circuit  with  Intermetrics  VHDL,  a 
“.com”  file  similar  to  the  following  is  required; 

#!/bin/csh  -v 

#  filename  ->  et_dff.com 

vhdl  ~ / vhdl /aox_gates /nand_nor . vhd 

vhdl  et_df  f . vhd 

vhdl  et_dff_test_bench. vhd 

vhdl  et_dff_config. vhd 

mg  ' -debug-cknd  nand_gate (simple) ' 

mg  '-debug-cknd  three_input_nand_gate (simple) ' 

mg  '-debug-cknd  et_dff (structural) ' 

mg  ’-debug-cknd  et_dff_test_bench (structural) ' 

mg  '-debug-cknd  -top  et_dff_config' 

build  '-debug-cknd  -replace  -ker-et_dff  et_dff_conf ig' 
sim  et_dff 

rg  et_dff  et_dff.rcl 
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In  this  example,  the  file  is  named  et_df  f .  com  and  is  executed  as  follows; 
~/vhdl/et_dff>  •t_def .con  >  •t_df£.out 

The  activities  of  the  signals  specified  in  the  “report  control  language”  file 
et_df  £ .  rci  will  be  reported  in  the  output  file  et_df  f .  rpt ,  and  can  be  used  to  validate 
the  output  of  VSIM  later. 

The  file  et  dff  .out  will  contain  a  record  of  the  Intermetrics  compilation,  model 
generate,  and  build  sessions.  This  file  will  contain  a  list  of  all  the  header  files  (of  the  form 
FN####)  necessary  to  compile  with  VSIM,  as  well  as  the  “Kernel  com”  file  (the 
Intermetrics  compilation  script).  To  extract  this  information,  the  following  commands 
may  be  used: 

~/vhdl/et_dff>  norm  •t_d££.out  |  gx«p  'H  £il*' 

~/vhdl/et_dff>  more  at_d££.ottt  |  gx«p  KArnal 

The  user  should  make  note  of  the  name  of  the  “Kernel  com”  file  for  use  with  the 
postprocessor  pbuiid.  The  header  files  should  be  copied  to  the  current  directory  from  the 
user’s  work  library.  For  example,  if  the  woric  directory  is  /usr/vhdi/shipiib/kkapp, 
the  following  command  can  be  executed  for  each  header  file  listed  in  et  df  f .  out: 

~/vhdl/et_dff>  cp  /usr/vhdl/shlplib/klcapp/rN####  . 


B.5.3  Run  Postprocessor  to  Transform  Code.  The  “Kernel  com”  file  is  the 
compilation  script  that  the  postprocessor  uses  to  build  the  VSIM  compa  ble  code  from 
the  Intermetrics  code.  The  output  report  of  the  postprocessor  is  always  written  to  a  file 
called  piex .  log  in  the  same  direcmry.  The  pos^rocessor  pbuiid  is  invoked  as  follows: 

-/vhdl/et_dff>  pbuiid  FN####  at_d££.c 

Note  that  pbuiid  concatenates  all  of  the  relevant  intermediate  source  code  files  into  one 
large  source  file  (for  this  example,  it  is  called  “big_et_df  f  .c”),  and  then  performs  a 
series  of  transformations  on  it  using  the  lexical  utility  piex  to  produce  the  VSIM  source 
file  (e.g.  et  dff  .c)  .  In  the  process  of  the  transformation,  not  all  of  the  necessary 
♦  include  directives  are  always  included  in  the  transformed  source  file.  They  can  be 
extracted  from  the  “big”  source  file  by  using  the  grep  command,  and  then  manually 
inserted  into  the  transformed  source  file.  This  usually  only  happens  on  large  circuits  (e.g. 
the  Wallace  tree  multiplier).  Reference  (4)  for  more  detail^  information. 


BJ.4  Run  Sequential  VSIM  Simulation.  The  following  files  are  needed  on  the 
sparcstation  in  order  to  tun  the  sequential  VSIM: 


-/kkapp/vhdl/vsim/vinit .c  ~/kkapp/vhdl/spectrum/globals .h 

vslm.c  application. h 

vsitn.h 

vspec.c 

vtools.c 


To  compile  VSIM,  a  makefile  is  required.  The  example  makefile  below  compiles 
VSIM  on  the  SPARC  using  the  command  “make  vsim.” 
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#  SPARC  macros  for  saquantial  axacution  -  typa  "maka  vaim" 

SPARC  S IMPATH-/U8 r2 /anq/kkanp/vhdl /vaim 

SPARC~CKTPATH-/u8r2/ang/kkapp/vhdl/at_dff 

SPARC_SPECPATH-/u8r2/ang/kkapp/vhdl/spactxum 

SPARC_OBJS-$ { SPARC_SIHPATH ) /vaim. o  \ 

${SPARC_SIISATH)/vinit.o  \ 

$ ( SPARC_SIMPATH} /vtools . o  \ 

$ { SPARC_CKTPATH) /at_d££ . o 

SPARC_CFLAGS— c  -w  -g  -DSPARC  -IRIAPPING  -DOOTPOT 

* 

#  Compllaa  VSIM  for  aaquantial  opaxation  on  tha  Sun  SparcStations. 

« 

vaim:  $(SPARC_OBJS} 

$(CC)  -o  at_dff  -g  ${SPARC_OBJS) 

${SPARC_SZMPATH)/valm.o:  $(SPARC_SIMPATH)/vsim.c  \ 

$ { SPARC_SIMPATH) /vaim.  h 
cd  ${SPARC_SIMPATH);  \ 

$(CC)  ${SPARC_CFLA6S)  -I${SPARC_SPECPATH)  vsim.c 

${SPARC_SIMPATH)/vinit.o:  $(SPARC_SIMPATH)/vinit.c  \ 

$ { SPARC_SIMPATH } /vs im . h 
cd  $(SPARC_SIMPATH);  \ 

$(CC)  ${SPARC_CFLAGS)  vinit.c 

${ SPARC  SIMPATH) /vtools. o;  $( SPARC  SIMPATHl/vtoola.c  \ 

${SPARC  SIffiATHJ/vsim.h 
cd  ${SPARC_SIMPATH);  \  ~ 

S(CC)  $(SPARC_CFLA6S>  vtools. c 

${SPARC  CKTPATH}/at_dff.o;  at  dff.c  $(SPARC_SIMPATH)/V8im.h 
$(CCT  ${SPARC_CFLAGS)  -I${SPARC  SIMPATH)  at_dff.c 


Once  the  makefile  has  been  completed,  the  following  commands  ctnnpile  and  run 
the  sequential  version  of  VSIM: 

~/vh(ll/et_dff>  main  vsia 
~/vhdl/et_df f>  etjdff  >  t«ap 

The  output  in  ten^  will  be  in  time  order.  However,  the  following  command  will  do  a 
secondary  sort  by  signal  name  and  place  the  output  in  the  file  et_df  f .  out: 

~/vhdl/et_dff>  sgrep  tmap  mtjdff.out 

The  data  in  et_df£.out  can  now  be  compared  to  the  Intermetrics  output  for 
accuracy  verification.  However,  the  only  way  to  do  this  is  manually,  since  the  two  output 
reports  will  be  in  different  formats. 


B5J  Extract  Behavior  Dependencies  using  VMAP.  Since  mapping  was  defined 
during  compilation,  the  output  in  tenp  also  has  behavioral  infinmation  (behavior  names, 
id  numbers,  and  dependencies).  Using  vmap,  this  information  can  be  filtered  out  of  temp 
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and  saved.  The  vmap  program  attempts  to  “guess”  the  delays  of  each  behavicn’,  based  on 
when  dependent  behaviors  are  scheduled.  Tlie  user  is  given  a  chance  to  override  these 
guesses.  In  most  cases,  the  behaviors  which  represent  gates  show  correct  delays;  the 
other  “system”  behaviors  should  be  set  to  a  delay  of  zero.  To  run  vmap,  t^  the 
following  and  respond  to  the  prompts  as  appropriate  (output  will  be  written  to 

et_ciff  .vmap): 


~/vhdl/et_df£>  vnip  t«ap  .vnap 

The  mapping  of  behavior  numbers  to  actual  behaviors  is  done  automatically  by 
VSIM.  Currently,  the  only  way  to  verify  this  mapping  is  to  compare  the  output  of  either 
VSIM  or  vmap  to  die  schematic. 


BJ.6  Generate  the  Circuit  Partition  for  Parallel  Execution.  Using  the  output  file 
from  the  previous  step  (e.g.,  et_df  f .  vmap),  use  the  VHDL  Graph-Partitioning  Tool  (GP- 
Tool)  to  generate  the  partition  for  the  desired  number  of  LPs.  Reference  the  GP-Tool 
User’s  Guide  for  specific  instrucdons  on  generating  the  partition.  The  needed  files  from 
this  step  are  the  ipx  .arcs  and  the  ipx .  map  files,  where  x  is  the  number  of  LPs. 


B5.7  Compile  and  Execute  the  Parallel  Simulation.  Copy  the  necessary  files 
specified  in  section  B.4.2  over  to  the  hypercube.  Before  compiling,  a  makefile  to 
compile  the  appropriate  files  is  required.  The  example  makefile  below  compiles  VSIM 
on  the  iPSC/2  hypercube  using  the  command  “make  ipse.” 


#  iPSC  macros  for  parallel  execution  on  iPSC/2  -  type  "make  ipse" 

#  local  paths 

MY_SIMPATH-/usr2/eng/kkapp/vsim 

My_CKTPATH-/usr2/eng/kkapp/et_dff 

MY_SPECPATH-/usr2/eng/kkapp/spectrum 

#  afit  paths 

AFIT_SIMPATH-/usr/simulate/vhdl/vsim 
UVA_SPECPATH-/usr/simulate/apectruro/afit 
AFIT_SPECPATH-/usr/simulate/spectrum/afit 
AFIT_SPECPATH_INC-/usr/siniulate/spectrum/af it /include 
AFIT_FILTERPATH-/usr/siriulate/spect rum/filters 

SPECHEADERS-$ {MY_SPECPATH ) /globals . h  $ {My_SPECPATH } /application . h 


MY_OB JECTS-$ ( My_S IMPATH ) / vs im . o  \ 

${My_SIMPATH)/vinit.o  \ 

${MY_SIMPATH)/vtools.o  \ 

${MY_SIMPATH)/vspec.o  \ 

$ { MY_SPECPATH ) /lp_man . o  \ 

${MY  SPECPATH}/cube2.o  \ 


${MY_SPECPATH}/vhdlclocks.o  \ 

$ { MY_CKTPATH ) /et_df f . o 

MY_CFLAGS— c  -w  -OMAPPING  -UOOTPOT  -DCOUNTS  -DMONITORCUBE 
ipse:  host  node 
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host:  $(MY_SPECPATO)/ho8t2.o 

$(CC)  -o  host  ${MY_SPECPATH)/host2.o  -host 

nods:  $(MY_OBJECTS} 

$(CC)  -o  st_df£  ${My_<»JBCTS)  -nods 

f  con^ilss  host2.c  out  of  ths  srchivs  dirsctoriss 
$(MY  SPECPATH}/host2.o:  $(AFIT_SPECPATH)/host2.c  \ 

$ { AFIT_SPECPATH_IHC ) /cubs2 . h 

cd  ${My_SPECPATH};  \ 

${CC)  $(in_CFLAGS)  -I${AF1T_SPECPATH_INC>  ${APIT_SPECPATH)/host2. 


t  con^llss  vsim.c  out  of  ths  srchivs  dirsctoriss 
${My  SIMPATH}/vsim.o:  ${AFIT_SIMPATH) /vsim.c  \ 

${AFIT_SIMPATH)/vsim.h  \ 
$(MY_SPECPATH)/globals.h  \ 

$ (MY_SPECPATH) /application . h 


cd  S{MY_SIMPATH);  \ 

$(CC)  ${MY_CFLAGS)  -I${MY_SPECPATH)  ${APIT_SIMPATH) /vsim.c 


#  con^ilss  vinit.c  out  of  ths  srchivs  dirsctoriss 
$(MY_SIMPATH)/vinit.o:  ${AFIT_SIMPATH) /vinit.c  \ 

${AFlT_SIMPATH)/vsim.h  \ 
${MY_SPECPATH}/9lobals.h  \ 

$ { MY_SPECPATH ) /application . h 

cd  ${MY_SIMPATH);  \ 

$(CC)  ${MY_CFLAGS)  -I${lff_SPECPATH}  $(APIT_SIHPATH) /vinit.c 

t  compilss  vtools.c  out  of  ths  srchivs  dirsctoriss 
${MY  SIMPATH)/vtools.o:  ${AFIT_SIMPATH) /vtools.c  \ 

${AFIT  SIMPATH)/vsim.h 

cd  ${MY_SIMPATH);  \ 

$(CC)  ${My_CFIAGS)  ${AFIT_SIMPATH) /vtools.c 

t  compilss  vspsc.c  out  of  ths  archivs  dirsctoriss 
${MY  SIMPATHl/vspsc.o;  ${AFIT  SZMPATH) /vspsc.c  \ 

~  ${AFIT~SIMPATH)/vsim.h  \ 

${My_SPECPATH)/globals.h  \ 

$ ( Hf_SPECPATH } /application . h 

cd  $(My_SIMPATH);  \ 

$(CC)  ${My_CFLAGS)  -I${MY_SPEaPATH}  $ {AFIT_SIMPATH) /vspsc .c 

#  compilss  lp_man.c  out  of  the  archivs  directories 
${My_SPECPATH}/lp_man.o;  ${OVA_SPECPATH)/lp__man.c  \ 

${SPECHEADERS)  ~ 

cd  ${My_SPECPATH);  \ 

$(CC)  ${MY_CFLAGS)  -I${Hy_SPECPATH)  ${OVA_SPECPATH) /lp_man.c 

#  compiles  cube2.c  out  of  the  archive  directories 
${My_SPECPATH)/cube2.o;  ${AFIT_SPECPATH}/cube2.c  \ 

$(SPECHEAOERS)  \ 

$ { AFIT_SPECPATH_INC  J /cube2 . h 

cd  ${My_SPECPATH);  \ 

${CC)  ${MY_CFLAGS)  -I${AFIT_SPECPATH_INC)  -1$ (MY_SPECPATH) 

$ { AFIT_SPECPATH ) /cube2 . c 


#  compiles  vhdlclocks.c  out  of  the  archive  directories 
${MY_SPECPATH)/vhdlcloclts.o;  ${AFIT_F1LTERPATH} /vhdlclocks.c  \ 

$(SPECHEAOERS} 

cd  ${MY_SPECPATH};  \ 

$(CC)  ${MY_CFLAGS}  -I$(MY_SPECPATH}  $ {AFIT_FILTERPATH) /vhdlclocks 
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#  coo^ilas  et_df£.c  out  of  local  directory 
${MY_CKTPATH)/at_dff.o;  ot_df£,c  $|AFIT_SIMPATH)/v8iiii.h 
$(CC)  ${MY_CFLAGS)  -I${AFIT_SIMPATH)  at_df£.c 


Once  the  makefile  has  been  completed,  the  following  command  compiles  the 
parallel  version  of  VSIM: 

c386  #:  aetlpe  #  (where  #  is  the  number  of  LPs  desired) . 
c386  #:  aake  lp«c 

It  should  be  noted  that  the  exact  command  depends  upon  the  makefile.  In  the  VSIM 
archives,  a  makefile  is  provided  with  each  circuit  that  will  cmnpile  cm  tlm  SPARC  with 
the  command  “make  vsim,”  on  the  iPSC/2  with  the  cennmand  “make  ipse,”  or  on  the 
i860  with  the  command  “make  i860.”  These  makefiles  can  be  used  as  tenqilates  for 
future  circuits. 

The  simulation  is  started  by  invoking  the  host  program  and  typing  the  appre^riate 
information  at  the  prompts,  including  entering  the  simulation  end  time  in  ns  as  a 
command  line  argument.  Below  is  an  example  simulation  session  for  two  LPs  (user 
entries  in  bold): 

CDBE386:  /uar2/eng/kkapp/et_dff  >  hosh 
Which  application  do  you  want  to  uae?:«t._jil££ 

Enter  the  command  line  argumenta  for  the^rogram  (RETURN  if  none)  : 

>2000 

la  aaaignment  of  logical  proceaaea  to  nodea  to  be  from  a  file?  (y/n)  ->  n 
The  cube  ia  being  uaed  as  follows: 

CDBENAME  USER  SRM  HOST  TYPE  TTYS 

iocube  root  cube386  cube386  0 

How  many  cube  nodes  do  you  want  to  use?  (0  to  ABORT) :2 
How  many  LP's  are  in  this  application?: 2 

Do  you  want  to  use  the  'natural'  node  assignment?  (y/n):  y 

Getting  cube  of  size  2  ~  stand  by. 

load  -H  -p  0  0  et_dff  2000 

load  -H  -p  0  1  et_dff  2000 

start cube 

Cube  Loaded 

LAST_TIME  message  from  LP  0  on  node  0,  pid  0. 

LAST_TIME  message  from  LP  1  on  node  1,  pid  0. 

End  stats  messages: 

LP  0  (node  0,  pid  0) :  661  received,  672  sent. 

Max  message  count  set  at  10,  Max  messages  removed  was  2. 

LP  1  (node  1,  pid  0) :  672  received,  661  sent. 

Max  message  count  set  at  10,  Max  messages  removed  was  1. 

HOST:  Total  CPU  time  waiting:  0.000000  (msecs) 

HOST:  Wall  clock  time  loading  cube:  5  (secs) 

HOST:  Wall  clock  time  waiting:  2  (secs) 
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Each  LP  will  lecord  its  timing  data  in  a  “log”  Ele  (e.g.  logo  Ah’  LPO).  Widi  counts 
defined  during  ccnqnlation,  these  files  will  also  contain  message  traffic  information.  The 
files  can  be  concatouted  and  viewed  as  follows: 


CUBE386:  /usr2/eng/klcapp/et_d££  >  cat  logO  logl  >  tiaa2.ottt 
CUBE386:  /usr2/eng/kkapp/et_d££  >  aK>ra  tiiaa?.ottt 

VSZM  LPO  reports  total  time  of  863 
LPO  NOLLS  Sent  -  652 
LPO  NULLs  Posted  -  0 

LPO  NULLs  Processed/Deleted  From  Myself  ••  0 

LPO  NULLs  Processed/Deleted  From  Another  LP  -  653 

LPO  NULLs  Annihilated  -  0 

LPO  Reals  Sent  «  20 

LPO  Reals  Posted  •  0 

LPO  Reals  Processed  From  Myself  -  0 

LPO  Reals  Processed  From  Another  LP  -  8 

LP  0  wall  time  taken  is  2.107  (secs) 

LP  0  messages  received  661 
LP  0  messages  sent  672 

VSIM  LPl  reports  total  time  of  851 
LPl  NULLs  Sent  •  653 
LPl  NULLs  Posted  -  0 

LPl  NULLs  Processed/Deleted  From  Myself  •>  0 

LPl  NULLs  Processed/Deleted  From  Another  If  -  652 

LPl  NULLs  Annihilated  -  0 

LPl  Reals  Sent  •  8 

LPl  Reals  Posted  -  0 

LPl  Reals  Processed  From  Itself  «  0 

LPl  Reals  Processed  From  Another  LP  >  20 

LP  1  wall  time  taken  is  2.098  (secs) 

LP  1  messages  received  672 
LP  1  messages  sent  661 


If  OUTPUT  was  defined  during  compilation,  the  signal  change  information  for  each  LP 
will  be  in  a  file  called  ipx.out  (where  x  is  the  LP  number).  These  files  can  be 
concatenated  and  sorted  with  sgrep  for  comparison  with  the  sequential  output 
(et_d££.out). 
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Appendix  C.  Graph  Partitioning  Tool  (GP-Tool) 


C.l  GP-Tool  User’s  Guide 

C.1.1  Overview.  Prior  to  executing  a  parallel  VHDL  simulatkm  using  VSIM,  it  is 
necessary  to  divide  the  simulaticm  wo&oad  anoong  the  available  processes.  This 
process,  referred  to  as  circuit  partitioning,  is  accomplished  by  die  VHDL  Graph- 
PaitititMiing  Tool  (GP-Tool).  GP-Tool  builds  a  behavior  inter-depcndency  grqih  from  a 
circuit  description  file  and  provides  several  partidtxiing  options  fm  assigning  die  vertices 
of  the  inter-dqiaidency  graph  to  die  specified  numbv  of  lo^cal  processes  (LPs).  This 
section  describes  1k>w  to  use  GP-Tool  to  read  in  a  circuit  descriptitm  file  and  generate  the 
partition  ouqiut  files  requited  by  VSIM’s  parallel  mode. 

The  current  versitMi  of  GP-Tool  (version  2.0)  is  an  extension  to  the  VHDL  Graph 
Searclung  Program,  henceforth  refoied  to  as  die  original  version  of  GP-Tool.  It  was 
written  in  1992  by  Maj  Eric  R.  Ghristensen,  USA,  instructor  at  the  Air  Force  Institute  of 
Technology,  in  mxler  to  provide  a  random  mapping  of  the  VHDL  behaviors  onto  die 
logicd  processes  of  the  pmallel  simulatitm.  The  ability  to  perform  a  topological  smt  on 
the  nodes  in  the  problem-gnqih  was  also  provided  (25).  GP-Tool  is  implemented  in  the 
Ada  programming  language  using  the  Sun  Ada  ^mpiler,  version  1.1  (available  on 
aurora  in  the  AFTT  parallel  simulidon  laboratmry). 


C.l  2  Building  the  Behavior  Inter-Dependency  Graph.  The  introducroiy  screen  to 
GP-Tool  is  shown  in  Hgute  64.  It  describ<^  the  required  format  of  the  circuit  description 
input  file  and  prompts  for  the  input  filename  (e.g.  “et.dff.viiuq)”  in  Figure  64).  The 
circuit  description  file  contains  a  list  of  circuit  behaviors  along  widi  their  names,  logical 


*********************************************<k*******i»***********)nm 
Welcome  to  the  VHDL  Graph  Partitioning  Tool  (GP-Tool)  -  Version  2.0 

ititit  it  it  it  it -kit  it  it  it  it  it  "k  "kit  it  it  it  itit* -it  it  it  it  it  it  it  it  it  it  it  it -kit  it -kit  ieit  it  it  it  it  it  it  it  it  it  it  it  it  it  it  it  it  it  it  it  it  is  It  it  it  it  it 

This  program  reads  a  file  with  the  following  format: 

-  Integer,  space,  String(1..80  characters).  Integer,  newline 
or  1..N  integers  followed  immediately  by  a  newline 

-  The  1..N  integers  are  considered  adjacencies  to  the  first  integer 

The  program  then  builds  a  graph  of  the  adjacencies  and  dependencies 


Enter  the  Name  of  the  input  data  file: 
et_dff .vmap 

—  Reading  Input  File  and  Inserting  Vertices  in  the  Graph 

—  Reading  Input  File  and  Inserting  Arcs  in  the  Graph 

Figure  64.  GP-Tool  Introductory  Screen 
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9  ET_DFF_TESI_BB1ICH{STI«K:TURAL)  0  12 
8  ET_DFF_TEST__BE1ICH  (STRUCTURAL)  0  3 
7  ET_DFF(STRircTURAL)  0 
6  ET_DFF (STRUCTURAL)  0 
5  NAND_GATE (SIMPLE)  3000000  4  7 
4  NANDJ3ATE (SIMPLE)  3000000  5  6 
3  MAND_GATE (SIMPLE)  3000000  0  2 
2  THREE_INPUT_NAND_GATE (SIMPLE)  3000000  3  5 
1  NAND_GATE (SIMPLE)  3000000  024 
0  HAND  GATE (SIMPLE)  3000000  1 


Rgure65.  GP-Tool  Input  Flic ‘‘ct_dff.vinap” 

delays,  and  list  of  dependencies.  With  this  information,  GP-Tool  builds  a  directed  graph 
data  structure,  with  each  vertex  corresponding  to  a  circuit  behavior,  and  each  arc 
representing  a  unique  behavior-to-behavior  dependo^. 

GP-Tool  assumes  that  the  input  file  contains  a  Um  for  each  behavior,  and  that  each 
behavior  d^cription  conforms  to  the  required  fimnat  specification.  The  current  version  of 
GP-Tool  does  not  contain  the  error  detection  and  recovery  mechanisms  necessary  to 
compensate  for  discrepancies  in  the  input  file.  An  incorrect  input  file  may  result  in 
erroneous  partition  ouq>ut  files,  or  may  cause  the  program  execution  to  be  abaiidcxied. 

Correctly  formatt^  input  files  can  be  product  by  following  the  instructicHis  in 
sections  B.3  and  B.4  of  the  AFIT  Parallel  VHDL  Simulation  User’s  Guide  (Appendix  B) 
for  using  the  vmap  utility  progranL  The  output  files  created  by  vmap  cmiform  the  GP-Tool 
input  file  specificatimis.  An  example  vmap  ouqnit  file  for  die  edge-triggered  D  flip-flop  of 
Figure  18  is  shown  in  Figure  65. 


C.13  Main  Mem  Options. 

C.lJ.l  Generate  Delay  and  Adjacency  htformation  File.  In  the  orig^ 
version  of  GP-Tool,  the  ipx .  arcs  files  were  not  created  directly.  Rather,  an  intermediate 


**********************  GP~T00L  MAIN  MENU  *********************** 

Select  one  of  the  following  operations: 

1  :  Generate  Delay  and  Adjacency  Information  File 

2  :  Generate  SGE  Data  File 

3  ;  Generate  Topological  Sort  File 

4  :  Generate  Strong  Conqponents  File 

5  :  Generate  Behavior  to  Logical  Process  (LP)  Mapping  File(s) 

0  :  Quit  GP-Tool 


Enter  your  menu  choice  now: 


Figure  66.  GP-Tool  Main  Menu 
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*SIZE 

10 

* 

♦SOURCE 
8  0 

9  0 
* 

♦SERVER 
0  3000000 

1  3000000 

2  3000000 

3  3000000 

4  3000000 

5  3000000 
* 

♦SINK 

6  0 
7  0 
* 

♦NETWORK 
0  1 

10  2  4 

2  3  5 

3  0  2 

4  5  6 

5  4  7 

6 

7 

8  3 

9  12 


Figure  67.  Example  Delay  and  Adjacency  File  for  Edge-Triggered  D  Flip-Flop 


file  was  created  which,  along  widi  the  ipx.map  file,  was  used  as  input  to  a  separate  utility 
application  called  buiid_arc  which  produced  the  required  ipx.  arcs  file.  The  delay  and 
adjacency  information  file  created  by  this  menu  cation  represents  that  intermediate 
description  file.  An  example  for  the  edge-triggeied  D  flip-flop  is  shown  in  Figure  67. 

However,  the  triplication  buiid_arc  is  no  longer  supported  and  is  unable  to  handle 
large  input  riles.  As  a  result,  the  functionality  to  pro^ce  the  ipx.  arcs  riles  was  built  into 
the  current  version  of  GP-Tool,  obviating  the  need  for  this  output  file.  Nevertheless,  tiie 
(r>tion  has  been  retained  in  the  event  that  it  is  needed  in  the  future. 


C.1 32  Generate  SGE  Data  File.  The  second  main  menu  tuition  allows  the 
user  to  create  a  gnrih  description  data  file  that  can  be  read  by  the  commercial  Synopsys 
design  analyzer  to  produce  a  graphical  representation  of  the  input  graph  that  can  be 
displayed  using  the  Synopsys  Graphic  E^tor  (SGE).  Again,  this  is  a  feature  of  the 
original  version  of  GP-Tool  that  was  retained  for  possible  future  applications.  The 
process  for  setting  up  the  Synopsys  design  analyzer  and  ii^essing  the  SGE  data  rile 
produced  by  GP-Tool  to  attain  the  graphic  representation  is  rather  complex  and  is  not 
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The  Number  of  Arcs  is  15 
The  Following  Nodes  have  no  Inputs 
18  19 

The  Following  Nodes  have  no  Outputs 


06 

07 

ADJ 

NO 

AO 

DSP 

NO 

A1 

A3 

ADJ 

N1 

A1 

A1 

OEP 

N1 

AO 

A9 

ADJ 

N2 

A2 

A2 

DEP 

N2 

A1 

A3 

ADJ 

N3 

A3 

A3 

DEP 

N3 

A2 

A8 

ADJ 

N4 

A4 

A4 

DEP 

N4 

A1 

A5 

ADJ 

N5 

A5 

A5 

DEP 

N5 

A2 

A4 

ADJ 

N6 

06 

DEP 

N6 

A4 

ADJ 

N7 

07 

DEP 

N7 

A5 

ADJ 

N8 

A8 

DEP 

N8 

18 

ADJ 

N9 

A9 

A9 

DEP 

N9 

19 

Figure  68.  Example  SGE  Data  File  for  Edge-Triggered  D  Flip-Flop 

discussed  here.  An  exaQq)le  SGE  data  rile  produced  by  GP-Tool  for  die  edge-triggered  D 
flip-flop  is  shown  in  Figure  68. 


C.133  Generate  Topological  Sort  File.  The  third  main  menu  cation  allows 
the  user  to  create  a  tqwlogical  ordering  of  die  nodes  in  the  behavior  interAlependency 
graph.  This  ordering  specifies  the  order  in  which  the  circuit  behaviors  wmild  have  to  be 
executed  if  simulate  sequentially.  This  is  another  feature  of  the  original  version  of  GP- 
Tool  that  was  retained  for  possible  future  applications.  An  example  topological  sort 
output  file  for  the  edge-triggered  D  flip-flop  is  shown  in  Figure  69. 


C.13.4  Generate  Strong  Components  File.  The  fourth  option  on  the  main 
menu  is  for  performing  a  strong  component  search  on  the  behavior  inter-dependency 
graph.  An  example  output  file  for  the  edge-triggered  D  flip-flop  is  shown  in  Fig^  70. 


N9  N8  N2  N1  N3  N7  N6  N5 
N4  NO 


Figure  69.  Example  Topological  Sort  File  for  Edge-Triggered  D  Flip-Flop 
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GRAPH  INFORMATION  -  et_df f .vmap 

The  number  of  vertices  in  this  graph  is  :  10 
The  number  of  arcs  in  this  graph  is  :  15 

PARTITION  INFORMATION  --  Strong  Coiif>onent  Search 

Number  of  components  :  6 
Inter-conponent  arcs  :  7 


The  Strong  Coitponent 
4  2  1 

sizes  are  : 

1  1 

1 

Component  Number  0 

-  Size:  4 

-  Local 

Arcs :  6 

2  3  0 

1 

Conponent  Number  1 

-  Size:  2 

-  Local 

Arcs :  2 

5  4 

Conponent  Number  2 

-  Size:  1 

-  Local 

Arcs :  0 

6 

Conponent  Number  3 

-  Size:  1 

-  Local 

Arcs:  0 

7 

Conponent  Number  4 

-  Size:  1 

-  Local 

Arcs :  0 

8 

Component  Number  5 

-  Size:  1 

-  Local 

Arcs :  0 

9 


Figure  70.  Example  Strong  Qmqwnent  File  for.  Edge-Triggered  D  Flip-Flop 

C.l  3 5  Generate  Behavior  to  LP  Mapping  Files.  The  fifth  option  on  the  GP- 
Tool  main  menu  takes  the  user  to  a  sub-menu  with  options  for  generating  partition  files 
using  one  of  several  partitioning  strategies  implemented  in  GP-Tool.  The  GP-Tool 
behavior  mapping  sub-menu  is  shown  in  Figure  71,  and  is  discussed  in  further  detail  in 
the  next  section. 

C.1.4  Mapping  Menu  Options. 

C.1.4.1  Generate  Partitioning  Files.  Options  1-6  on  the  GP-Tool  behavior 
mapping  sub-menu  allow  the  user  to  create  the  circuit  panition  files  ipx .  map  and 
ipx.arcs  required  to  execute  the  simulation  in  parallel  using  VSIM.  In  addition,  a 
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************  gp-tool  behavior  mapping  menu  ************* 

Select  one  of  the  following  operations: 

1  :  Generate  Random  Partitioning  File 

2  :  Generate  Single  Depth-First  Partitioning  File 

3  :  Generate  Single  Breadth-First  Partitioning  File 

4  :  Generate  ABl -Annealing  Partitioning  File 

5  :  Generate  AB2-Annealing  Partitioning  File 

6  :  Generate  AB3-Annealing  Partitioning  File 

7  :  Turn  the  .MAP  and  .ARCS  output  OFF 

8  :  Modify  the  Cost  Function  Parameters 

9  :  Return  to  Main  Menu 
0  :  Quit  GP-Tool 


Enter  your  menu  choice  now: 


Figure  71.  GP-Tool  Behavior  M^ing  Sub-Menu 


partition  statistics  file  is  created  such  as  the  one  shown  in  Figure  27.  The  following 
partitioning  options  are  available: 

•  Randon  Partition  -  Use  a  random  number  function  to  randomly  distribute 
the  behaviors  among  the  specified  number  of  LPs,  ignoring  the  behavior  inter¬ 
dependency  relationships.  The  user  will  be  prompted  to  input  a  random  stream 
number  between  1  and  100  that  is  used  as  an  input  to  the  random  number 
generator.  This  is  the  only  partitioning  t^tion  that  was  available  in  the  original 
version  of  GP-Tool. 

•  Siopia  Dopth-rirst  (SDP)  Partition  -  Use  a  depth-first  search  algorithm 
to  traverse  the  behavior  inter-dependency  graph  and  determine  the  LP 
assignments. 

•  sinpla  Braadth-rirst  (SBT)  Partition  Use  a  breadth-first  search 
algorithm  to  traverse  the  behavior  inter-dependency  graph  and  determine  die  LP 
assignments. 

•  ABl -Annealing  Partition  -  Use  the  AB  border-annealing  algorithm  to 
refine  an  initial  SDF  partition. 

•  AB2 -Annealing  Partition  -  Use  the  AB  border-annealing  algorithm  to 

retine  an  initial  SBF  partition. 

•  AB3-Anneaiing  Partition  -  Use  the  AB  border-annealing  algorithm  to 

refine  an  initial  random  partition. 

Each  of  the  AB  Annealing  partition  options  require  a  set  of  input  parameters  to 
control  the  border  annealing  process.  The  parameters  that  are  specific  to  the  AB 
Annealing  algorithm  are  presented  to  the  user  in  a  sub-menu  after  the  user  has  specitied 
the  number  of  LPs  and  entered  the  appropriate  tile  names. 
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Figure  72.  GP-Tooi  AB  Annealing  Parameters  Sub-Menu 


The  AB  Annealing  parameters  sub-menu  is  shown  in  Figure  72.  If  the  default 
parameter  values  are  satisfactory,  the  user  can  enter  ‘0’  to  begin  die  partitioning  process. 
Otherwise,  the  parameter  values  can  be  changed  by  entering  the  appn^ate  line  number 
and  entering  the  new  value  (if  appropriate),  llie  specific  parameters  are  as  follows: 

•  HuaftMie  of  itoratlons  -  Defines  the  maximum  number  of  annealing 
iterations  to  perform  befme  terminating  the  process,  with  a  maximum  of  IfXX). 
Realistically,  the  default  value  of  S(X)  provides  more  than  enough  iterations  to 
converge  to  a  solution  with  the  circuits  used  as  test  cases  in  this  diesis. 

•  Max  Nuabax  of  worthiass  xtoratlons  -  Defines  the  maximum  number  of 
consecutive  iterations  with  no  net  in^ovement  in  the  communications  cost 
portion  of  the  objective  cost  function  wlwh  can  be  processed  before  the  annealing 
process  is  terminated.  The  counter  which  tracks  worthless  iteratitms  is  reset  to 
zero  each  time  there  is  an  con^lete  iteration  with  a  net  improvement  in  the  subject 
cost  function.  This  value  should  be  large  enough  to  ensure  that  the  series  of 
worthless  iterations  indicates  an  actual  solution  convergence  and  not  jusi  a 
temporary  anomaly  in  the  annealing  process. 

•  Load  lobaianca  Toiaranca  -  Defines  the  maximum  value  of  the  load  delta 
factor  Hb  that  is  acceptable.  Moves  which  cause  Hb  to  be  larger  than  the  value  of 
this  parameter  will  not  be  made,  even  if  they  would  result  in  a  reduction  in  the 
communications  cost  sub-function.  A  value  of  0.0%  for  this  parameter  will 
automatically  be  defaulted  to  the  value  of  Hb  in  the  initial  partition.  A  load 
imbalance  of  one  behavior  can  result  in  the  initial  partition  algorithm  if  the 
number  of  behaviors  is  not  evenly  divisible  by  the  number  of  LPs.  Howcvct,  if  the 
number  of  behaviors  is  divisible  by  the  number  of  LPs,  the  initial  partition  will 
have  an  Hb  of  0.0%  and  the  a  0.0%  value  for  this  parameter  will  prevent  any 
moves  from  occurring,  rendering  the  annealing  process  useless. 
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•  ignox«__CoiflnjDist_ractox  -  Boolean  value  that  allows  the  factor  to  be 
ignored  when  computing  the  value  of  the  communications  sub-function  during  the 
annealing  process.  When  false,  it  is  possible  to  experience  a  slight  increase  in  the 
number  of  LP_output_Line3  as  the  value  of  Hd  is  reduced.  However,  when  true, 
the  annealing  algorithm  will  have  a  tendency  to  prevent  such  an  increase  to  the 
number  of  LP  Output  Lines  (which  has  a  direct  impact  on  the  number  of  null 
messages  sent),  as  well  as  lead  to  a  larger  reduction  in  the  number  of  inter-LP 
arcs.  Furthermore,  if  the  number  of  LP  output  Lines  was  reduced  when  this 
parameter  was  false,  setting  it  to  true  may  lead  to  a  larger  reduction. 

•  Includa  Hop  Woights  in  Priorities  -  When  calculating  the 
communications  costs  of  a  partition,  each  inter-LP  arc  is  multiplied  by  a  hop- 
weight  corresponding  to  the  number  of  hops  in  the  corresponding  physical 
communications  link.  When  set  to  true,  this  parameter  will  take  this  weighting 
into  account  when  prioritizing  and  queueing  behaviors  during  the  annealing 
process.  It  should  be  noted  that  the  default  hop  weights  are  all  1.0  (evenly 
weighted),  thus  rendering  this  option  meaningless.  The  hop  weights  can  be 
modified  using  option  eight  on  the  behavior  mapping  sub-menu.  If  uneven  hop 
weights  are  used,  setting  this  parameter  to  true  has  a  tendency  to  deteriorate  the 
performance  of  the  annealing  algorithm,  with  no  noticeable  improvement  in  the 
solution  quality. 

•  Log  Anne  aing  Data  -  Boolean  value  that  controls  the  printing  of  the 
partition  statistics  values  to  an  output  file  for  the  initial  partition  and  after  each 
annealing  iteration.  This  allows  the  progress  of  the  annealing  algorithm  to  be 
examined. 

•  Include  Debug  znfo  in  Log  -  Boolean  value  that  will  cause  information  to 
be  added  to  the  annealing  log  file  for  each  behavior  that  is  removed  from  the 
annealing  queue.  This  is  for  development/debugging  purposes  only.  When  true, 
the  number  of  iterations  should  be  reduced  to  the  1-5  range,  or  the  annealing  log 
will  become  too  large  to  be  of  practical  value.  This  parameter  will  have  no  effect 
if  the  previous  parameter  is  set  to  false. 

•  Annealing  Log  Filename  -  Defines  the  name  of  the  annealing  log  output 
file. 


C.1.4.2  Toggle  .MAP  and  .ARCS  Output.  Option  seven  on  the  GP-Tool 
behavior  mapping  sub-menu  allows  the  user  to  toggle  the  creation  of  the  ipx .  map  and 
ipx.  arcs  files  on  and  off.  This  allows  the  user  to  create  only  the  partition  statistics  files 
for  comparison  purposes  without  incurring  the  overhead  of  entering  filenames  and 
creating  the  ipx. map  and  ipx. arcs  Hies.  It  is  included  primarily  as  an  aid  to  the 
development  and  testing  process.  It  should  be  noted  that  when  the  ipx. arcs  file  is  not 
produced,  the  value  of  Lares  is  set  to  1.0  in  the  calculation  of  the  predicted  simulation 
speedup  in  the  partition  statistics  file  because  the  actual  lookahead  values  are  not 
available. 


C.1.4.3  Modify  Cost  Function  Parameters.  Option  eight  on  the  GP-Tool 
behavior  mapping  sub-menu  allows  the  user  to  set  several  miscellaneous  parameters  that 
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*************  MODIFY  PARAMETERS  MENU  ************* 

The 

current  parameter  values  are: 

1 

Consider  Topological  Variation  -  false 

2 

Ignore  Zero  Delays  in  .arcs  file  -  true 

3 

Alpha  -  100.00 

4 

Beta  -  1.00 

5 

Gamma  -  0.0750000000 

6 

One_Hop_Weight  -  1.0 

7 

Two_Hop_Weight  -  1.0 

8 

Three_Hop_Weight  -  1.0 

9 

Four_Hop_Weight  -  1.0 

10 

Five_Hop_Weight  -  1.0 

11 

Six_Hop_Weight  -  1.0 

12 

Seven_Hop_Weight  -  1.0 

Enter  the  line  number  of  the 

parameter  to  update,  or  zero  (0)  to  exit: 

Figure  73.  GP-Tool  Cost  Function  Parameters  Sub-Menu 


are  not  specific  to  the  AB-Annealing  algorithm.  The  modify  parameters  sub-menu  is 
shown  in  Figure  73.  The  specific  parameters  include  the  following: 

•  Cmaldmr  Topological  Variation  -  Boolean  value  which  controls  whether 
cv  not  the  to^logical  layout  of  the  hypercube  is  considered  when  building  an  SDF 
or  SBF  partition.  Reference  section  3.4.4  for  more  information. 


•  ignoca  z«ro  Delays  in  .arcs  Tila  -  Boolean  value  which  controls  how 
zero-delay  behaviors  are  handled  during  the  calculation  of  the  LP  delay  values  for 
the  ipx.  arcs  tile.  If  true,  a  source  behavior  with  a  logical  delay  of  zero  in  LP  A 
with  an  external  arc  to  LP  B  will  not  cause  the  lookahead  value  for  the  LP  output 
line  from  A  to  B  to  be  set  to  zero.  Rather,  the  smallest  non-zero  value  calculated 
for  that  output  line  will  be  used.  This  works  due  to  the  fact  that  in  VSIM,  all 
source  behaviors  must  have  their  signal  changes  explicitly  detined  in  the 
testbench,  thus  causing  them  to  be  placed  in  the  active  list  at  simulation  startup. 


•  Alpha,  Beta,  and  Gamna  -  Coefficients  to  the  partition  cost  function  that  are 
used  in  the  calculation  of  the  predicted  speedup.  These  values  have  no  effect  upon 
the  partitioning  algorithms. 


•  Hop  Weights  -  Weights  associated  with  each  inter-LP  arc  based  upon  the 
number  of  hops  in  the  corresponding  physical  communications  link.  These  hop 
weights  will  not  affect  the  random,  SDF,  or  SBF  partitions  other  than  increasing 
the  value  of  the  communications  cost  function.  However,  since  the 
communications  cost  function  is  actively  used  in  the  AB  border  annealing 
process,  the  hop  weights  will  directly  affect  the  resulting  AB  annealing  partition. 
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C.2  GP-Tool  Developer' s  Guide 


This  section  provides  a  brief  description  of  the  Ada  source  files  required  to  compile 
and  run  the  current  version  of  the  GP-Tool  utility.  The  source  code  is  hi^y  noodularized, 
with  different  functions  performed  by  separate  Ada  packages.  The  following  is  a  list  of 
the  primary  Ada  source  files  in  the  GP-Tool  hierarchy: 


ab_annealing_pkg_b . a 
ab_annealing_pkg_s . a 
annealing_toolsjpkg_b . a 
anneal ing_tool3_pkg_s . a 
build_graph_b . a 
builci_graph_s .  a 
graph_tool . a 
misc_vhcil_pkgs_s .  a 
print_graph_b . a 
print_graph_3 . a 


printing_pkg_b . a 
printing__pkg_3 .  a 
rand_gen_b . a 
rand_gen_3 . a 
random__partition_b .  a 
random_partition_3 . a 
3bf_partition_pkg_b. a 
3bf_partition_pkg_3 . a 
3c_3earch_pkg_b .  a 
3C_3earch_pkg_s .  a 


3df_partition_pkg_b . a 
3df_partition_pkg_3 . a 
3ort_vhdl  jp .  a 
3tati3tic3__pkg_b.  a 
3tati3tic3_pkg_3 . a 
tool3jpkg_b.a 
tools_pkg_3 . a 
vhdl_top_3ort_jp .  a 


The  remaining  files  required  to  compile  GP-Tool  fall  into  the  category  of  generic  Ada 
packages,  and  are  as  follows: 


digraph_utilitie3_b. a  li3t_3eatch_b. a 

digraph__utilities_3 .  a  list_3earch_3 .  a 

directed_graph_b . a  Ii3t_utilitie3_b . a 

directed_graph_s . a  liat_utilitie3_s . a 

gen_doubly_linked_li3t_b . a  map_unbounded_cache_b . a 
gen_doubly_linked_li3t_3 . a  inap_unbounded_cache_8 . a 
gen_static_3trings_b . a  priority_^queue_b. a 

gen_3tatic_3trings_3 .  a  priority___queue_8 .  a 

generic__queue_b .  a  3ecL_storage_ningr_b .  a 

generic_queue_s . a  3eq_3toragejmngr_3 . a 

generic_top_8ort_b . a  3et_iterator_b . a 

generic_top_3ort_s . a  3et_iterator_3 . a 

lim_private_map_pkg_b . a  stack_pkg_b . a 

lim_private_nvap_pkg_3 .  a  3tack_pkg_3 .  a 

Figure  74  shows  how  the  various  Ada  piu^kages  interact  to  form  the  complete  GP- 
Tool  application.  Note  that  misc_vhdi_pkg3_3 .  a  contains  several  package  instantiations 
in  a  single  file.  Note  also  that  the  tigure  does  not  specify  the  specific  generic  package 
dependencies,  but  treats  all  of  the  generic  packages  as  a  single  group  in  order  to  simplify 
the  diagram. 

The  functionality  of  the  most  critical  packages  are  listed  below.  Each  of  these 
packages  also  contains  extensive  source  file  comments  with  more  detailed  information. 

•  graph_tooi  -  provides  the  overall  program  flow  control,  displaying  the 
menu^and  making  the  appropriate  sub-routine  calls  based  upon  the  menu  option 
selected  by  the  user. 

•  b\iiid_9raph  -  reads  the  input  file  and  builds  the  behavior  inter-dependency 
graph  stmcture  in  memory. 

•  randoin_partition_pkg  -  performs  a  random  mapping  of  the  behaviors  to  the 
given  number  of  LPs. 
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<kiph_Taol 


Figure  74,  GP-Tool  Ada  Package  Dependency  Graph 

•  sdfjpartition^pkg  -  performs  a  simple  depth-first  (SDF)  mapping  of  the 
behaviors  to  the  given  number  of  LPs, 

•  8bf_p*rtitionj>kg  -  performs  a  simple  breadth-first  (SBF)  mapping  of  the 
behaviors  to  the  given  number  of  LPs. 
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•  ab_ann««ilng[_partltionjpkg  -  takes  a  partitioned  graph  as  input  and 
performs  the  AB  border  annealing  algorithm  in  an  effort  to  improve  the  quality  of 
the  partition. 

•  annMllng^tools_pkg  -  provides  several  utility  functions  used  by  the  AB 
txmler  annealing  algorithm,  such  as  displaying  the  armealing  para^ters  sub¬ 
menu,  initializing  the  vertex  priorities,  and  printing  relevant  data  and  debugging 
information  to  the  annealing  log. 

•  tooi8_pkg  -  consolidates  several  utility  functions  used  by  all  of  the 
partitioning  algorithms  into  a  single  file  in  order  to  minimize  code  duplication  and 
improve  code  maintainability;  functions  include  initializing  data  structures  and 
linking  a  new  vertices  to  the  Parent-Child  chains  representing  a  given  LP 
assignment. 

•  8tati8tic8_picg  -  provides  routines  needed  to  evaluate  a  partitioned  graph 
and  calculate  the  statistics  values  associated  with  the  partition  (inter-component 
arcs,  load  imbalance,  predicted  speedup,  etc.);  includes  a  routine  to  build  and 
initidize  the  communications  wei^t  matrix. 

•  printingjpkg  -  provides  routines  to  print  the  key  output  files  including  the 
partition  statistics  file,  the  ipx .map  file,  and  the  ipx .  arcs  file. 

•  8c_a8arch_pkg  -  provides  routines  to  perform  a  strong  component  search  on 
an  unpartitioned  input  file,  linking  the  strong  components  together  similar  to  the 
LP  assignment  lists. 

•  print  jgrraph  -  provides  routine  to  print  the  “queue.dat”  file  used  as  input  to 
the  buiidi_arc  Utility  which  can  produce  the  corresponding  ipx. arcs  file;  note 
that  buiid_arc  is  no  longer  supported  and  cannot  handle  la^e  input  graphs. 

•  8ort_VHDL  -  provides  routines  to  perform  a  topological  sort  on  an  input 
graph. 
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Appendix  D.  Simulation  Performance  Data 


This  appendix  contains  additional  simulation  performance  data  to  supplement  the  data 
appearing  in  Chapter  5.  The  first  section  contains  tables  which  summarize  the  actual 
performance  data  for  a  selected  subset  of  test  cases.  The  second  section  contains 
additional  message  traffic  analysis  graphs  for  a  selected  subset  of  test  cases. 

Table  3  summarizes  the  p^ormance  of  the  single  LP  Wallace  tree  multiplier  case. 
Tables  4  to  16  summarize  the  performance  of  the  Wallace  tree  multiplier  fm*  all  four 
partitioning  types:  random,  SDF,  SBF,  and  AB  border  annealing.  Tables  are  included  for 
2, 4,  and  8  LPs  for  each  partition  type.  Each  table  contains  the  simulation  time,  the  total 
time  (which  includes  the  time  to  load  the  cube  nodes),  real  message  transmitted,  null 
messages  transmitted,  and  total  messages  transmitted.  All  performance  measurements  are 
with  respect  to  the  simulation  time.  All  times  are  in  ns.  Each  table  also  summarizes  the 
corresponding  partition  statistics  values  as  calculated  by  GP-Tool.  Tables  17  to  28 
provide  an  identical  set  of  data  for  the  associative  memory  array. 

Figures  75  to  78  provide  additional  real  messages  vs.  null  messages  transmitted 
graphs  to  supplement  ^e  test  cases  discussed  in  Chapter  5.  Graphs  are  provided  for  both 
the  Wallace  tree  multiplier  and  the  associative  memoiy  array.  Figures  79  to  84  provide  the 
corresponding  total  messages  transmitted  vs.  output  arcs  graphs,  while  Figures  85  to  90 
provicte  the  total  messages  transmitted  vs.  LP  output  lines  graphs. 
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Table  4.  Wallace  Tree  2  LP  Random  Paititiaa  SimuUlion  Renilta 


Table  S.  Wallace  Tree  4  LP  Random  Paniiion  Simulalicn  Results 


Table  6.  Wallace  Tree  8  LP  Random  Paitition  Simulation  Results 
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Table  9.  Wallace  Tree  8  LP  SDF  Partition  Simulation  Results 
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Table  10.  Wallace  Tiee  2  LP  SBF  Paitiiioa  Simulaiian  ReaulU 


Table  12.  Wallace  Tree  8  LP  SBF  Paitition  Simulation  Results 
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Table  13.  Wallace  Tree  2  LP  AB2  Paitition  SinuiUiian  Reaibs 


Table  1 S.  Wallace  Tree  8  LP  AB2  Paitition  Simulaiion  Results 
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Tabkl6.  AModtive  Memory  1  LP  Simwlitinn  Rewito 
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• 

SMdn.CoopJmd 

-  OjO 

Max^-O’^-toad 

•  OjO 

Load-Dsitt.  Fiettir 

•  OM 

ftidkad  Snman 

•  OjOO 

-A^ 

4J»Il*T4 

4.SS2J27 

• 

• 

• 

Ev!T3 

_ 5L£2 _ 

S1J63 

0 

0 

0 
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Table  17.  Aiaodaiive  Memory  2  LP  Randofn  PaitiiioaSimidaiion  Results 


I. 7S12M 
i^aoLSSi 
ijni.6t6 

ijnui2 

J. StJ.7» 

i.ntjm 

1,77<2(» 

i.»ia,tos 

l.St2,7SS 


1>I03» 

337JKS 

13.3M 

3S1^ 

isnjm 

S7J2S 

13.403 

352421 

imM 

337  J25 

13.357 

351.1t2 

337.125 

13.490 

35U15 

is9\jm 

vnjm 

13.432 

351457 

imsi6 

mjns 

13.473 

35149$ 

i^ysjKt 

337  J25 

13434 

3S1.349 

U323M 

337  J2S 

13.390 

351415 

iOSIJSS 

337425 

13.530 

351455 

2»04Ql916 

337425 

13^452 

351477 

Table  1 8.  Assodaiive  Memory  4  LP  Rarulom  Partition  Simuladon  Results 


W(bt_lniBrJJ*_Afct 

•  6461-0 

AV|_Wtlt_AR> 

-  1.740.3 

SMdsv.WflULOuLAro 

•  3174 

MaidBv.WftaLOat.Arci 

>  4724 

StSaBv_W||il_ta..Axci 

-  91j0 

l4«ldBV_W^lt.|Q..AlCt 

-  30J 

CoiBav_Goft_ftLCtai 

-  74,75% 

CaiDBi_INii_Pacttv 

•  27.17% 

LP.Oatpitt_UQBf 

•  12 

l^ikahBid  Factor 

•  1400 

Avf_Coap_toad 

•  1,0604 

Stddev_CaaipJ«oid 

-  OJ 

MaxdBv.Goii9_Load 

-  04 

Lat4. Della  factor 

>  0X0% 

Predlded  SoeodoD  •  3.11  1 

1.3M.947 

1.496.105 

U3t.9M 

1.4n.lSg 

US2^ 

1408.436 

1^03.714 

].439jni 

1451.495 

1.3«<^t2 

1423.007 

U31430 

1.487.421 

14(n.6S4 

1.463475 

1J4S.794 

140«471 

1.3«7.435 

1423.614 

Table  19.  Associative  Memory  8  LP  Random  Partition  Simulation  Results 


W|f  Inter  iJ  Awta 

Avg_Wf)l_Arcs 

Stddev_W|ht.ODt_Arcs 

Maxdev_Wgbt_Out_Ara 

Stddev_W^t_ln_Arci 

MaxdBv.Wflit_lQ_Arcs 

Coam.Cott.Pftctar 

Coani_Dist_PKtor 

LP.OfUput.Lioet 

LookatettL  Factor 


1499.473 

1453499 

1.437416 

1491.823 

l,4S3.Wl 

1.607.432 

1.460423 

1.615489 

1407419 

1.662.466 

1,428424 

1482431 

1.468422 

1.623449 

1.447.797 

1,602430 

1.455.769 

1.610497 

1.415.692 

1470438 
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TaUe  20.  AmoQative  Memory  2  LP  SDF  Pmtiiion  Simulalion  Results 


Avg^Coaip_Lotd 

-  2.121J  1 

SUOrr.CoDip^Laad 

.  0.7 

|Maxdr^_Ooi^_Lotd 

-  OJ 

L<ad_Otttt_Fictar 

•  0j02% 

1.74M22 

1.709JS7 

\jm,m 
1.7K009 
1.747JI51 
1.739^2 
1.70M52 
1.729 J74 

1.49^449 


lJ73.11t 

1JS5317 

IJTI^ 

1.911.490 

1309.5a 

136M99 

1395.441 

1337391 

1396^017 


IBEJIS 


Table  21.  Assodalive  Memory  4  LP  SDF  Partition  Simulation  Results 


Table  22.  Associative  Memory  8  LP  SDF  Partition  Simulalion  Results 
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Table  23.  Aiaocialive  Memory  2  LP  SBF  Panilion  SimuUtian  ReiulU 


ciraia 

-  Smc  Mn 

lill 

i'FHi"Tg| 

2439412 

2.70IJI* 

264.131 

7.038 

271.169 

Piitittao 

•  SBF 

241  UtO 

2772921 

264431 

7.070 

271401 

2AS*.*** 

2820.698 

264431 

7,097 

271.168 

2^i*jat 

Z.72I.2M 

264.131 

7,096 

271.187 

N«4J48 

WBKt 

2.5SWS7 

2749418 

264431 

7.019 

271.176 

taan^LPAici 

2422437 

2412409 

264.132 

7.094 

271489 

W|ti_havJJ»,ATc* 

-  3.707i) 

2433.263 

2782409 

264.131 

7.061 

271.199 

Avg-WdiLAfC* 

•  14934 

2612499 

wx»*i 

264431 

7.066 

271,187 

S«Mn_W|^Oat_Arc> 

•  6729 

2742464 

264.131 

7.081 

271412 

l4M6iv_W«la_0«LAiGB 

-  4794 

10 

2jnM* 

2733469 

264.131 

7.069 

271400 

Sldd8v_W|ta-k-Arci 

.  6729 

MaxOn.WghtJiLAfc* 

-  4754 

CnoL.OoaJ’tctor 

-  3941« 

C(nnL.01fU^cttr 

•  2949% 

LP.OuqpiCLlDM 

•  2 

1  nofcihiid  Ptcttr 

•  0j667 

Avg.Coa9.JiOad 

■  2.121J 

SiSd«v_CaBip_Latd 

-  0.7 

MmdiffjOaapJjo^ 

-  04 

L«d  DbUr  Pftcttr 

•  0.02ft 

•  1.96 

P'l 

264431 

7.Sff7 

271488 

_ 

39.680 

0 

14 

14 

Table  24.  Associative  Monoiy  4  LP  SBF  Partition  Simulation  Results 


1J02^9 

1.761.017 

1.900^192 

1.92M23 

1.763>I2 

1,761.027 

IjMfUU 

152M01 

1.763J37 


1.130,694 

1,999.591 

1.915.298 

2.094.494 

2.078.081 

1,917.910 

1.915.316 

2.094411 

2.078.0S7 

1417.487 


1BKID5JH 


Table  2S.  Associative  Memory  8  LP  SBF  Partition  Simulation  Results 


Circuit 

■  AiKK  Mem 

ES^i 

Kfaasga 

RTTStitr-CTI 

■jS!l!E59l 

1 

2464.669 

2.619 

901.074 

294,439 

795409 

PtititioD 

•  SBF 

2 

2099.664 

22S0.030 

901.072 

295,665 

796.737 

3 

2398,870 

2.S52.S31 

901.070 

295.631 

796.701 

wm 

2384.139 

2.J38J#8 

501.070 

296413 

797483 

n 

2428459 

2383.033 

501,070 

295.721 

796.791 

II 

2392312 

2546495 

501,072 

299.645 

796,717 

W|l4_iDierJ.P_Arci 

-  7457.0 

n 

2433.466 

2487.701 

501.070 

296457 

797427 

Avs  Well  Aid 

-  907.1 

2484449 

2546.995 

901,070 

296447 

797.417 

Stddev_W|liI_Oia_Arci 

-  878.9 

■1 

2384,786 

2951.485 

501,070 

295.783 

796453 

MudBv_  Wffat_Out_  Arcs 

-  2108.9 

10 

2436.649 

2430,129 

901.072 

296424 

797496 

Stddev_W|^_In_Arcs 

•  400J 

MtxdBV.Wfbt.la^Arcs 

-  4794 

Conm.CDSt.Pactor 

-  7743% 

Coiiin_Dist_Psctar 

-  23248% 

LP.Outpot.Lines 

-  39 

Lookahead  .Factor 

•  0442 

Av|_Cofqp_Load 

-  5304 

Stddev_Cfliiq>_Load 

-  04 

Max<lBv_Cbfnp_Loid 

•  0.6 

Load.Deha.Pactor 

•  0.12% 

•  271 

gn 

2430375 

258M99 

561,871 

295482 

796473 

■ 

KtT’''Ti 

108499 

106.060 

1 

538 

537 
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TaUe  26.  Associative  Monoiy  2  LP  Afil  Particion  Simulatian  Results 


W(li..kil>rJ.P>ia 

-  IJHSi 

Avs_Wtt>_Arci 

-  637X) 

Svtdev.Wjiai-Otf.Arcs 

•  MZO 

Maulrf  _  W||)|_OKt.  Arcf 

•  2S6ja 

Siddiv_W||U_ln_>rci 

•  3«2i} 

Max<kv_WghtJn_Afcs 

-  2S&0 

Coam_CoKj^cttr 

•  i3.ei« 

Coam.OittJPacttv 

•  40.19% 

LPjOmpiitJjDM 

-  2 

1  onfcihBtt  Pictpr 

•  0.667 

Av^CcNis>^UMd 

-  2.121.S 

S(ddBv_Csii9_L<ad 

-  14S 

M«xdiv_Cioiiq)_Lottd 

•  lOJ 

Lotd  Delta  Factor 

-  0.49% 

1  1  1  ii— I  — 

Table  27.  Associative  Memoiy  4  LP  ABl  Paitition  Simulation  Results 


Avf_Cofiip_Loed 

•  1.060S 

StdaBv.Cacq>_Laad 

•  lOJ  ' 

Mtxdev.Coap.Load 

-  5J 

Load  Delta  Factor 

•  049%  ' 

1.103,822 

1,260.109 

1.00^145 

1.162,880 

9ti9.6M 

1.126.411 

991,067 

1.147,930 

899,179 

1.053,936 

896,319 

1.033360 

1.086^93 

1342.699 

1,016090 

1.173^ 

943,176 

1.100.039 

1,041,043 

i 

1.197378 

Table  28.  Associative  Memoiy  8  LP  ABl  Paitition  Simulation  Results 


Circuit 

•  A«oc  Mem 

iTTtnffTWI 

1 

1,186.793 

1342304 

82348 

232,155 

314,703 

Partition 

•  ABl  7 

2 

1.169.694 

1326.063 

82348 

232,437 

314,983 

3 

1.111.823 

1368388 

82348 

232.051 

314399 

4 

1,183.344 

1342327 

82348 

232,693 

313,241 

SHHi 

5 

1.216263 

1375378 

82348 

232332 

314,880 

6 

I.I36t923 

1.293330 

82348 

233734 

315.282 

Wsl*_toter_LP_Arci 

'  2394.0 

7 

1.120,123 

1376.639 

82348 

233716 

313.264 

Avg_Wfht_Arcs 

-  324J 

8 

l.r44496 

1381.187 

82348 

233627 

315.175 

Stddev_W^t_Out_Arcs 

•  364.1 

9 

1,184,460 

1341.099 

82348 

233007 

3U3SS 

Maxdev  _  W|ht_Oat_  Arcs 

>  583S 

10 

1,149380 

1303.815 

82348 

233309 

314.857 

Stddev_Wsbt_In_Arcs 

-  383 

MaxdBv_Wght_In_Arca 

-  703 

ConiD.Coct J^actflr 

•  2736% 

Cocm.Dlffi.Factar 

-  180.65% 

LP.OtUput.Uaes 

-  40 

Lookahead  Factor 

•  0302 

Avg_Cofta>_Load 

Stddev.Crap.Load 

-  74 

Maxdev.Comp.Load 

-  23 

Load_Detta  Factor 

-  049% 

.  3.40 

Exni 

82348 

233406 

■ 

33,199 

33.674 

0 
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Random  Partition  Simple  Depth^First  Partition 
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Figure  75.  Wallace  Tree  6  LP  Reals  Sent  vs.  Nulls  Sent  Message  Analysis 


Random  Partition  Simple  Depth-First  Partition 
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Figure  76.  Wallace  Tree  7  LP  Reals  Sent  vs.  Nulls  Sent  Message  Analysis 
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Figure  77.  Associative  Memory  4  LP  Reals  Sent  vs.  Nulls  Sent  Message  Analysis 
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Figure  78.  Associative  Memory  7  LP  Reals  Sent  vs.  Nulls  Sent  Message  Analysis 
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Figure  80.  Wallace  Tree  6  LP  Total  Messages  Sent  vs.  Output  Ares 
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Figure  81.  Wallace  Tree  7  LP  Total  Messages  Sent  vs.  Output  Arcs 


Random  Partition  Simple  Deptb>First  Partition 
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Figure  82.  Associative  Memory  4  LP  Total  Messages  Sent  vs.  Output  Arcs 


Random  Partition  Simple  Depth-First  Partition 
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Figure  83.  Associative  Memoiy  7  LP  Total  Messages  Sent  vs.  Output  Arcs 
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Figure  84.  Associative  Memoiy  8  LP  Total  Messages  Sent  vs.  CXitput  Arcs 


Random  Partition  Simple  Depth>First  Partition 
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Figure  85.  Wallace  Tree  4  LP  Total  Messages  Sent  vs.  LP  Output  Lines 


Random  Partition  Simple  Depth-First  Partition 


Figure  86.  Wallace  Tree  6  LP  Total  Messages  Sent  vs.  LP  Output  Lines 
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Figure  87.  Wallace  Tree  7  LP  Total  Messages  Sent  vs.  LP  Output  Lines 


Random  Partition  Simple  Depth-First  Partition 
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Figure  89.  Associative  Memory  7  LP  Total  Messages  Sent  vs.  LP  Output  Lines 
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Figure  90.  Associative  Memory  8  LP  Total  Messages  Sent  vs.  LP  Output  Lines 
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